How to Extract Text from a PDF

· 7 min read

Copying text from a PDF can be surprisingly frustrating. Formatting breaks, columns get merged, and line breaks appear in the wrong places. A dedicated text extraction tool pulls the raw text content from the PDF structure, giving you clean plain text you can actually work with. A browser-based extractor handles the entire job locally without uploading your document to a server.

Text-based vs scanned PDFs

Before extracting text, it helps to understand what kind of PDF you have:

Text-based PDFs: created from Word documents, web pages, or other digital sources. The text is stored as data inside the PDF. You can select and highlight text when viewing these files. Text extraction works perfectly with these.

Scanned PDFs: created by scanning a physical document. The PDF contains images of pages, not actual text data. You cannot select text in these files. Standard text extraction returns empty results, you need OCR software instead.

Hybrid PDFs: some PDFs contain a mix of digital text and scanned images. The extractor will capture the text portions but not the image-based content.

Searchable scanned PDFs: a scanned PDF that someone ran through OCR with the text layer embedded behind the page images. Text extraction works on these because the OCR text is stored in the PDF. The accuracy depends on the OCR quality, scanned-OCR text often has typos from misrecognized characters.

How to extract text from a PDF

  1. Upload your PDF: select the file or drag and drop it. The tool accepts any standard PDF.
  2. Extract text: click the extract button. The tool processes all pages and displays the raw text.
  3. Copy or download: copy the text to your clipboard or download it as a TXT file.

A brief history of PDF text extraction

PDF was created in 1993 by Adobe with a deliberately complex internal structure. A PDF stores text as positioned glyphs (character + x/y coordinate on the page), not as flowing prose. To extract readable text, a tool has to read these glyph positions and reconstruct paragraphs by inferring word boundaries, line breaks, and reading order.

The first widely-used PDF text extractor was pdftotext (1996), part of the open-source xpdf project by Derek Noonburg. It used a simple algorithm: sort glyphs by Y then X, group by line, group lines into blocks. Most modern extractors still use a refined version of this approach.

PDF.js (Mozilla, 2011) brought PDF rendering to the browser without a plugin. It also exposed a text-extraction API that powers most browser-based extractors today, including this one. PDF.js can read every PDF feature the browser needs: text, images, forms, annotations, signatures, embedded fonts.

The main improvements over the years have been:

Modern extraction is good for prose documents (books, articles, contracts). It still struggles with multi-column scientific papers, complex tables, and heavily-formatted brochures.

When text extraction is useful

Output format options

Different uses need different output formats:

Format Best for Limitations
Plain text (.txt) Universal, no formatting Loses headings, lists, tables
Markdown (.md) Structured docs, headings preserved Tables may need manual fix
HTML Web display, preserves bold/italic More complex than .txt
Word (.docx) Editing in Microsoft Word Loses some PDF-specific formatting
JSON Per-page or per-block extraction For developers, not direct reading
XML/EPUB E-book conversion Complex setup

For most everyday extraction (copying a paragraph, feeding text to an LLM), plain text is the right choice. For long documents you intend to re-edit, PDF to Word is usually better.

Common pitfalls

Alternative approaches

If browser-based extraction does not work for your PDF:

For confidential PDFs that should not leave your machine, browser-based extraction (this tool) or local command-line tools (pdftotext) are the only safe options.

Tips

Privacy and confidential PDFs

The PDF text extractor runs entirely in your browser. The PDF you upload, intermediate processing, and the extracted text all stay on your device. Nothing is uploaded to a server, logged, or shared with anyone.

This matters because PDFs you extract text from are often very sensitive: contracts with embedded clauses you need to quote, medical records and lab reports, financial statements with account numbers, legal pleadings under attorney-client privilege, employment offer letters and salary details, internal corporate documents, research papers under embargo before publication, scanned IDs and passports, immigration documents. Cloud PDF extractors by design upload your files to their servers, often retain them for "service improvement," and have been involved in real data leaks where confidential contracts and medical records ended up indexed by search engines. A browser-based extractor has zero exposure: the PDF never leaves your machine.

Browser-based extraction also works offline once the page is loaded, useful for processing documents on airplanes, in secure facilities without internet access, or anywhere you cannot or should not upload a confidential document to a third party.

Frequently Asked Questions

Why did my PDF extraction return empty results?

The PDF is likely a scanned document, it contains images of text, not actual text data. Text extraction only works with PDFs that have embedded, selectable text. For scanned documents, you need OCR (optical character recognition) software.

Does this tool use OCR?

No. It extracts embedded text directly from the PDF structure. This is faster and more accurate than OCR for text-based PDFs, but it cannot read text from scanned images.

Is my PDF uploaded to a server?

No. All processing happens in your browser. Your PDF never leaves your device, making it safe for confidential documents.

Can I extract text from a specific page?

The tool processes all pages and returns the complete text. You can then copy or edit the specific sections you need from the output.