PDF → HTML 转换器,免费
从 PDF 文档中提取文本并转换为干净的语义化 HTML。即时预览,下载或复制代码。
处理中…
关于 PDF → HTML 转换
此工具使用 PDF.js 从 PDF 文件中提取文本并将其渲染为语义化 HTML。非常适合将文档转换为 Web 兼容格式、归档内容或为后续处理准备文本。
常见用途
- 文档归档· 将 PDF 转换为 HTML,用于长期数字保存和网页访问。
- 内容迁移· 将 PDF 中的文本提取为结构化 HTML,用于 CMS 或网页发布。
- 文本提取· 从 PDF 获得干净的文本用于分析或复用。
- 网页发布· 将文档转换为 Web 兼容格式,加载快速,可访问性更好。
- 数据处理· 为进一步的转换或集成准备 PDF 内容。
常见问题
支持多大的 PDF?
此工具可处理约 10 MB 以内的 PDF,具体取决于您的浏览器。非常大或复杂的 PDF 处理时间可能更长。
它会保留 PDF 的格式吗?
工具提取文本内容并以段落形式渲染。复杂的版式、图片和样式会被简化为干净的 HTML。
可以下载 HTML 吗?
可以。点击「下载 HTML」将转换后的内容保存为 .html 文件,可在任意浏览器或编辑器中打开。
A short history of PDF, from PostScript to a portable page
The Portable Document Format was the brainchild of John Warnock, the co-founder of Adobe Systems, who had previously co-invented PostScript, the page-description language that, beginning in 1985 with the Apple LaserWriter, made desktop publishing possible. PostScript was extraordinarily powerful but it was a programming language, not a document format: a PostScript file described how to render a page when fed into an interpreter, but it was not really meant to be read, edited, or rendered consistently across machines that lacked the right fonts.
In 1991, Warnock circulated an internal Adobe memo that became known as the Camelot Project. The premise: Adobe should build a single file format that could capture any document (including its fonts, layout, vector graphics, and images) and reproduce it identically on any computer, on any operating system, regardless of which application originally created it. By the time the proposal had been refined, the project had a product name: Adobe Acrobat.
Acrobat 1.0 and PDF 1.0 were demonstrated at the Comdex Fall trade show in Las Vegas in November 1992 and shipped to customers in June 1993. Adobe's pivotal commercial decision came in 1994 when it began giving Acrobat Reader away for free, a move that mirrored what had happened with HTML browsers and seeded a base of millions of installations. The format went through several revisions, PDF 1.1 (1996, external links and security), 1.2 (1996, AcroForms), 1.3 (2000, digital signatures), 1.4 (2001, transparency), 1.5 (2003, object streams and JPEG 2000), 1.6 (2004, OpenType and 3D), 1.7 (2006). In 2008, PDF 1.7 was published as ISO 32000-1:2008; PDF 2.0 followed as ISO 32000-2:2017, with a substantially revised second edition (ISO 32000-2:2020) that incorporated errata and is the current authoritative reference.
Several specialised PDF profiles exist alongside the main standard: PDF/A (ISO 19005-1:2005, with -2 in 2011 and -3 in 2012) is the archival profile that prohibits features which depend on external resources or future software (no JavaScript, no audio, no encryption, fonts must be embedded); PDF/X is the prepress profile used by the printing industry; PDF/E is the engineering profile for technical drawings; PDF/UA (ISO 14289-1:2014) is the universal-accessibility profile that requires logical structure and tagging so screen readers can present content in reading order; PDF/VT is the variable-and-transactional profile used in personalised mail merge.
Why PDF-to-HTML is structurally hard
A PDF is, at its lowest level, a collection of numbered objects (dictionaries, arrays, strings, numbers, names, and binary streams) laid out with a cross-reference table at the end that lets a reader jump to any object without parsing the whole file. The objects form a tree rooted in a Catalog object that points to a Pages tree, which contains Page objects. Each Page references a content stream: the actual instructions that draw the page.
A content stream is a sequence of compact graphical operators in a small language related to PostScript but not Turing-complete. For text, it uses Tf (set font and size), Td (move text position), Tm (set text matrix), Tj (show a string), TJ (show an array of strings with optional individual character offsets for kerning), and ET (end text). The crucial point is that everything is positional. A paragraph of body text is not stored as a paragraph. It is stored as a series of Tj or TJ commands, each one drawing a glyph or a short run of glyphs at a specific x and y coordinate on the page. There is no notion of a sentence, a paragraph, a heading, a list, or a column, only the question of where each character physically sits.
HTML is the inverse: a flowing tree of semantic elements where layout is the renderer's responsibility and the same HTML can reflow to fit a phone, a desktop, or a screen reader. Converting PDF to HTML therefore requires reverse-engineering a structure the PDF was never required to record. A converter has to look at the spatial distribution of text on each page and infer:
- which characters belong to the same word (by measuring the gap between glyphs against the font's average advance width);
- which words belong to the same line (by clustering on the y-coordinate within a tolerance);
- which lines belong to the same paragraph (by detecting baseline differences and indentation patterns);
- which paragraphs belong to the same column (by finding vertical gutters of whitespace);
- and the reading order across multiple columns, footnotes, and side notes.
None of this is solved by reading the content stream in order, because the order in the stream is not necessarily the reading order. A layout engine that produced the PDF may have drawn elements in the order most efficient for rendering, which can be top-to-bottom in zig-zag, or by font, or by colour. The text of a single paragraph can be interleaved with text from neighbouring paragraphs in the underlying stream. This is why PDF extractors that simply concatenate strings in stream order produce mangled output for anything more complex than a single-column novel.
If the PDF has been tagged: that is, if its author included a structure tree alongside the visual content, the job becomes far easier. A tagged PDF includes a hierarchy of structure elements (P for paragraph, H1 through H6 for headings, L for list, LI for list item, Table, TR, TD, Figure, Caption) that mirror HTML's semantic vocabulary. PDF/UA mandates tagging for accessibility precisely because untagged PDFs are essentially opaque to assistive technology. In practice, however, the majority of PDFs in the wild are not tagged, or are tagged badly by the authoring tool, so a robust converter has to fall back to layout analysis even when tags are present.
The major open-source PDF rendering libraries
PDF.js is the JavaScript library written by Mozilla, originally launched in June 2011 as an experimental project led by Andreas Gal. It parses and renders PDFs entirely in the browser using HTML5 canvas and JavaScript, with no native plugin required. PDF.js was bundled into Firefox as the default PDF viewer beginning with Firefox 19 in March 2013, replacing the Adobe Reader plugin. It exposes a JavaScript API that lets a page extract text content with positional metadata (each text run comes back with its x, y, width, height, font name, and font size). This tool is built on PDF.js.
Poppler is a C++ library forked from xpdf, the venerable PDF viewer Glyph and Cog has maintained since the late 1990s. Poppler powers the PDF-rendering features of Linux desktop environments (Evince in GNOME, Okular in KDE), the pdftotext and pdftohtml command-line utilities, and many server-side PDF processing pipelines. MuPDF, by Artifex Software (the same company that maintains Ghostscript), is a smaller and faster C library targeted at embedded use. PDFium is the engine that ships inside Google Chrome and Microsoft Edge for built-in PDF viewing; it is a fork of the proprietary Foxit PDF SDK that Google and Foxit jointly open-sourced in May 2015. qpdf is a C++ library and command-line tool focused on structural manipulation rather than rendering, it can decompress, encrypt, decrypt, linearise, and rewrite PDFs without changing their visual content.
For producing HTML output specifically, the most important purpose-built project is pdf2htmlEX, originally written by Lu Wang in 2012 and now maintained by a community group. pdf2htmlEX takes a different approach from most converters: instead of trying to reconstruct semantic HTML, it reproduces the visual layout of the PDF as faithfully as possible by emitting absolutely positioned div elements for each text run, embedding the original fonts as Web Open Font Format (WOFF) files, and using CSS transforms where necessary. The result is a webpage that looks indistinguishable from the original PDF, but the underlying HTML is a wall of position: absolute spans with no semantic meaning.
Layout fidelity vs semantic flow, the central trade-off
This is the central trade-off in PDF-to-HTML conversion: you can have layout fidelity or you can have semantic flow, but it is hard to have both. A fidelity-first converter like pdf2htmlEX produces output that prints and looks like the original but is opaque to a screen reader and rigid on a phone screen. A flow-first converter like pdftotext or PDF.js's getTextContent followed by simple paragraph reconstruction produces clean, readable, accessible HTML, but loses the visual richness of the source, colours, exact fonts, image placement, table grids, and any sense of the original page.
The Absolutool tool sits firmly on the flow-first side. It extracts the text content using PDF.js and emits it as paragraphs, prioritising readability, accessibility, and small file size over a pixel-perfect reproduction. If you need the visual reproduction route (every glyph in its original position, original fonts embedded, exact pagination preserved) pdf2htmlEX is the tool to look at; if you need the readable-paragraphs route (content reuse, web publishing, search-indexable HTML, screen-reader-accessible output) this tool is fit for purpose.
Embedded fonts, images, and the vector content underneath
A PDF can embed any font it likes, and a converter that wants to preserve the original look has three options. Embed-and-serve: the converter extracts each embedded font from the PDF, repackages it as a web font in a format browsers understand (WOFF or, since 2018, WOFF2 with its more aggressive Brotli compression), and links to it from the generated HTML. This preserves the original look but inflates file size and may run into licence issues if the font's embedding rights do not extend to web redistribution. Substitute: map each embedded font to a similar system font (a serif PDF font might become Times New Roman or Georgia), accepting some visual drift in exchange for a smaller, cleaner output. Ignore: discard font information entirely and let the browser apply a default body font, which is what most flow-first converters do because the user reads the HTML in the browser's normal styling.
Images present a similar choice. A converter can extract embedded images as separate files and reference them from the HTML; rasterise entire pages as images and embed them inline (turning the PDF into a glorified gallery); or drop images entirely and emit text only, which is the choice this tool makes, appropriate for content reuse rather than visual reproduction. Vector content (lines, shapes, paths drawn by the PDF's graphics operators) is even more awkward, because there is no clean way to represent it in semantic HTML; converters that want to preserve it tend to fall back to inline SVG or to PNG rasterisation.
When the PDF is a picture: OCR fallback for scanned documents
A significant fraction of PDFs in the wild are not really documents in the structured sense at all, they are scanned images of paper documents, packaged in PDF wrappers because PDF is the universal format for sending paper-like things over the internet. A scanned PDF has no text content stream; each page is a single embedded raster image that happens to depict text but does not contain it as machine-readable characters. Extracting text from such a PDF requires Optical Character Recognition (OCR), which is a fundamentally different operation from text extraction.
The dominant open-source OCR engine is Tesseract, originally developed at HP Labs between 1985 and 1995, open-sourced in 2005, and maintained by Google from 2006 until Google handed primary stewardship to a community group around 2018. Tesseract supports more than a hundred languages, runs on every major platform, and powers the OCR features of countless desktop and server tools. Apple's Vision framework, available on macOS and iOS since 2017, includes a fast and accurate text-recognition API used by the OS's built-in screenshot and photo apps. Google Cloud Vision, Azure Computer Vision, and Amazon Textract are the major cloud OCR services; for documents specifically, Textract and Azure's Document Intelligence both go beyond raw OCR to recognise tables, key-value pairs, and form fields.
A browser-based PDF-to-HTML converter that runs entirely client-side cannot generally perform OCR, the OCR models are tens of megabytes at minimum and the inference is too slow to run interactively on a user's laptop. If your PDF contains scanned pages with no extractable text, this tool will produce empty output for those pages, and the right next step is a separate OCR tool or a server-side service.
Why people convert PDFs to HTML
The use cases fall into a handful of recurring patterns:
- Publishers with archives of legacy reports: annual reports, white papers, research papers, government publications, technical manuals, convert PDFs to HTML so the content can be read directly in a browser without forcing every visitor to download a file. The HTML versions are easier to navigate (you can link to a specific section), faster to load on a phone, and crawlable by search engines.
- Bloggers and content marketers convert PDFs they have authored or licensed into HTML so they can republish the content as articles, repurposing the text without re-typing it.
- Web archivists convert PDFs to HTML as part of preservation projects, on the theory that HTML is more durable across decades than a complex binary format whose specification is several thousand pages long.
- Mobile reading apps and read-it-later services convert PDFs to HTML so that articles open in their reader views, with the user's preferred font and font size, dark mode, and adjustable line length.
- Screen-reader users convert PDFs to HTML because untagged PDFs are essentially opaque to assistive technology, but HTML rendering with proper paragraphs and headings can be read aloud cleanly.
- Knowledge-management workflows convert PDFs to HTML to feed the content into search indexes, full-text databases, and large-language-model context windows, none of which natively understand the PDF format, but all of which handle plain text and HTML well.
- Researchers and academics extract text from PDFs of journal articles to feed into citation managers, reference databases, or text-mining pipelines that look for patterns across thousands of papers.
Each of these has slightly different requirements, but they share a common thread: the user wants the content of the PDF (the words, the structure, the meaning) without being constrained by the rigid page-bound format the original document chose.
More questions
Why does my converted HTML look different from the original PDF?
This tool is flow-first: it extracts the text and emits clean paragraphs in your browser's default font, prioritising readability, accessibility and search-indexability over visual fidelity. If you need a pixel-perfect reproduction of the original layout (embedded fonts, exact positioning, original colours) look at fidelity-first tools like pdf2htmlEX, which emits absolutely-positioned div elements that match the source PDF visually but produce HTML that is essentially unreadable to screen readers and rigid on phone screens.
Why is my multi-column PDF coming out scrambled?
PDF doesn't store reading order, only positions. A converter has to infer column boundaries from the spatial distribution of text. For simple two-column layouts the heuristic usually works; for complex layouts with side notes, footnotes that interleave with body text, or text that crosses a column gutter, it can produce out-of-order output. If you have a tagged PDF (one that includes a structure tree), accuracy is dramatically better; for untagged PDFs, the result depends on how clearly the columns are physically separated by whitespace.
My PDF is scanned (just images), why isn't anything extracted?
A scanned PDF has no text content, each page is a raster image of text rather than text the parser can read. Extracting text requires OCR (Tesseract, Google Cloud Vision, Apple's Vision framework, etc.), which is a fundamentally different operation from PDF parsing. This tool doesn't bundle an OCR engine because the models are too large to ship with a browser tool. The right next step is a dedicated OCR service or a desktop tool with OCR built in.
Can I convert a password-protected PDF?
If the PDF has an open password (you need to type a password just to view it), PDF.js will throw an error rather than convert. If the PDF has only a permissions password (open is free but printing/copying are restricted), behaviour varies, most modern PDF.js builds respect the permissions and may refuse to extract text. Either way, the cleanest path is to remove the protection in the original PDF tool first, then convert. Removing protection on a PDF you legally own is fine; doing it on a PDF you don't own may not be.