How-To Guides

How to Extract Text from a PDF Free (Works on Scanned PDFs Too)

There are over 2.5 trillion PDFs in existence, and getting text out of them is harder than it should be. The right approach depends on one thing: whether your PDF has a text layer. This guide covers both cases — native PDFs with selectable text, and scanned image-based PDFs that require OCR — with free browser-based tools for each.

By · May 21, 2026 · 7 min read · Updated May 2026
Key Takeaways

  • Native PDFs (created from Word, web pages) have a text layer — extract instantly with Extract Text
  • Scanned PDFs are page images with no text — use the OCR tool, which runs Tesseract.js in your browser
  • Quick test: if you can't highlight text in your PDF reader, it's scanned
  • Tesseract.js hits over 95% accuracy on clean printed text (Tesseract project, 2024)
  • Both tools output plain .txt — no formatting recovery, no Word conversion

Most guides on extracting text from a PDF skip the part that actually matters: there are two completely different types of PDFs, and they require different tools. Use the wrong approach and you'll either get an empty file or no output at all. Let's fix that.

What Are the Two Types of PDFs — and Which Do You Have?

There are two fundamentally different kinds of PDFs. Native PDFs contain an actual text layer — text your reader can index, search, and highlight. Scanned PDFs are photographs of documents: pages are stored as images with no text data at all. Over 400 million scanned documents are processed via OCR every year, which shows just how common this problem is. (Tesseract project statistics, 2024)

Native / Digital PDF

Has a text layer

Created directly from software — Word, Excel, Google Docs, a web browser, InDesign. The text exists as actual character data in the file.

  • You can select and copy text in any PDF reader
  • Ctrl+F search works on the content
  • File size is usually smaller than scanned equivalents
  • Text extraction is instant and lossless
Scanned / Image PDF

Pages are images

Created by scanning a paper document with a scanner or phone camera. Each page is stored as a JPEG or PNG image — no text layer exists.

  • You cannot select or copy any text
  • Ctrl+F finds nothing
  • File size is larger (images are heavy)
  • Requires OCR to extract text

How to tell which type you have: open the PDF in any reader (Adobe, your browser, Preview on Mac) and try to click and drag to highlight a word. If you can highlight text, it's a native PDF. If clicking produces nothing, or selects the entire page as one image block, it's scanned. You can also use FusionPDF's PDF reader to check — it shows whether a text layer is present.

How to Extract Text from a Native PDF (3 Steps)

FusionPDF's Extract Text tool reads the text content stream directly from the PDF file. It's instant, lossless, and works on any native PDF, including ones with copy restrictions — because it reads the underlying data, not the rendered display. The output is a plain .txt file containing all the text in reading order.

1

Open the Extract Text tool. Go to fusionpdf.pro/extract-text. No account or sign-up required.

2

Upload your PDF. Click "Select PDF" or drag your file onto the page. The file is loaded into your browser's memory using the FileReader API. Nothing is uploaded to any server.

3

Download the .txt file. Click "Extract Text". Processing runs locally — the tool parses the PDF's content stream and collects all text objects. Your browser downloads a .txt file containing the full extracted text, typically in under a second for most documents.

Limitation: layout is not preserved. The output is plain text. Column layouts, tables, headers, and multi-column formatting are not reconstructed — text flows in content-stream order, which may differ from reading order on complex layouts. If you need exact formatting recovery in Word format, see the honest note at the bottom.

Native PDF text extraction works by parsing the document's content stream — the internal sequence of drawing instructions that includes character codes and their positions. Because this is a read operation on structured data rather than a visual render, it's faster than OCR and perfectly accurate on any PDF with an embedded text layer, regardless of what display-level copy restrictions may be set. FusionPDF Extract Text implementation, based on PDF specification (ISO 32000-2)

How to Extract Text from a Scanned PDF with OCR (4 Steps)

For scanned PDFs, OCR (Optical Character Recognition) is the only path. FusionPDF's OCR tool uses Tesseract.js, the JavaScript port of Google's open-source Tesseract engine. Tesseract achieves over 95% character accuracy on clean printed text, according to the Tesseract project benchmarks. (Tesseract OCR project, 2024)

95%+
character accuracy on clean printed text Tesseract.js — the OCR engine powering FusionPDF's OCR tool — reaches over 95% accuracy on standard printed documents. Accuracy drops on low-quality scans, dense handwriting, or unusual fonts.
1

Open the OCR tool. Go to fusionpdf.pro/ocr. No account or sign-up required.

2

Upload your scanned PDF. Click "Select PDF" or drag your file onto the page. Your file stays in the browser — nothing is sent to a server at any point.

3

Select the document language. Choose from English, French, Spanish, German, or Portuguese. This loads the matching Tesseract.js language model and meaningfully improves recognition accuracy, especially for accented characters and language-specific patterns.

4

Run OCR and download. Click "Run OCR". Each page is rendered as an image and passed through the Tesseract.js recognition pipeline. Processing time scales with page count — a 10-page document usually takes 30-60 seconds on a typical laptop. The result downloads as a .txt file when complete.

Supported languages

The OCR tool currently supports five languages: English, French, Spanish, German, and Portuguese. Select the primary language of your document before running OCR. If your document mixes languages, choose the dominant one. Multi-language detection in a single pass is not currently supported.

Accuracy expectations

Tesseract.js performs best on clean, high-contrast scans of printed text at 300 DPI or above. Accuracy degrades on low-contrast scans (pencil, faded ink), unusual fonts, dense handwriting, or images taken on a phone at an angle. We've found that even a slight rotation of the page image can noticeably reduce recognition quality.

Tip for better OCR results: Scan at 300 DPI minimum if you have the option. If you're working from a phone photo, make sure the page is flat, well-lit, and the camera is positioned directly above (not at an angle). Higher-quality input consistently produces better output from any OCR engine.

Tesseract.js is a pure JavaScript port of the Tesseract OCR engine, originally developed at HP Labs and now maintained by Google. It runs entirely in the browser using WebAssembly, with no server-side component. Over 400 million scanned documents are processed via Tesseract annually across all its implementations. Character accuracy exceeds 95% on clean printed text across supported languages. Tesseract OCR project (GitHub: tesseract-ocr/tesseract); Tesseract.js (GitHub: naptha/tesseract.js), 2024

Why Can't I Copy Text from My PDF?

There are two distinct reasons why text copying fails in a PDF reader. The first is the most common: your PDF is scanned, so there's no text to copy. The second is less common but genuinely frustrating: the PDF owner set copy-protection permissions that disable clipboard operations in the reader. These two problems have different solutions.

Reason you can't copy What it looks like Solution
Scanned PDF (no text layer) Clicking selects the entire page as one image; Ctrl+F finds nothing OCR tool — recognizes text from the image
Copy permissions disabled Text highlights normally but Ctrl+C does nothing; "Copy" is greyed out Extract Text tool — reads the content stream directly, bypassing display restrictions

When a PDF has copy restrictions, the text layer is still present in the file — the restriction only tells PDF readers to disable clipboard operations in their interface. It doesn't encrypt the actual text data. Reading the content stream directly, as FusionPDF's Extract Text tool does, works around this display-level restriction without any decryption. Note that this applies only to copy-restricted PDFs — fully encrypted PDFs with password protection on file opening are a different case.

If you're sharing a PDF and want to prevent others from extracting specific sections, a better approach than copy restrictions is actual text redaction. See FusionPDF's Redact tool for permanently removing sensitive content from a PDF's content stream.

What Can You Do with Extracted Text?

Plain text extracted from a PDF is more useful than it might first appear. Because it's unformatted and machine-readable, it can feed directly into tools and workflows that formatted documents can't. Here are the most common practical applications, each of which benefits from having clean text output.

  • Research and quotes. Pull specific passages from reports, papers, or legal documents for citation. Plain text is easier to search, copy selectively, and paste into notes or writing tools.
  • Feeding translation tools. Paste extracted text into DeepL, Google Translate, or a language model. Uploading formatted PDFs to translation tools often produces messy output. Plain text produces clean results.
  • Making scanned documents searchable. Extract text from a scanned PDF, save the .txt alongside the original PDF, and your search tool (or your own search index) can find content inside it.
  • Data extraction from reports. Pull tables or numerical data from PDFs for further analysis. While layout isn't preserved, the numbers and labels are all there in the text output.
  • Accessibility. Convert scanned documents to text so screen readers and assistive tools can process them. Scanned PDFs are completely inaccessible to screen readers without OCR.
  • Feeding AI tools. Language models work with plain text. Extracted PDF content can go directly into a prompt for summarization, classification, or question-answering tasks.
2.5T
PDFs estimated to exist worldwide With over 2.5 trillion PDF files in existence, text extraction is one of the most frequent document operations across research, business, legal, and government workflows. (ISO PDF statistics)

If you're considering feeding extracted PDF content to an AI service and your documents contain sensitive data, read our PDF privacy guide for what to think about before sharing document content with third-party tools.

Honest Note: This Is Text Extraction, Not PDF-to-Word Conversion

FusionPDF's tools extract text — they don't reconstruct Word documents with formatting. If you need to recover a PDF back into a fully formatted .docx file with columns, tables, images, and fonts intact, that's a fundamentally different (and harder) problem. Here's what actually exists in that space, stated plainly.

Text extraction gives you:

  • All the words and numbers from the document
  • Basic reading-order flow of paragraphs
  • A plain .txt file you can edit in any text editor

Text extraction does not give you:

  • Tables reconstructed as tables
  • Multiple columns laid out correctly
  • Headers, fonts, or styling
  • Images from the original document
  • A .docx file you can open in Word

If you need full formatting recovery: Adobe Acrobat Pro's PDF-to-Word export is the most reliable paid option for layout reconstruction. It handles tables, columns, and images with reasonable accuracy on native PDFs. It's not free — but it's honest about what it does. For simple text extraction (what most people actually need), the free tools on this page are sufficient.

In our experience, the majority of people asking "how do I convert PDF to Word" actually just want to copy the text content — which text extraction handles perfectly. Full layout recovery is only necessary when you need to edit the document while preserving its exact visual structure.

Frequently asked questions
What is the difference between Extract Text and OCR?

Extract Text reads the text layer that already exists inside a native PDF — it's instant and perfectly accurate. OCR (Optical Character Recognition) scans the page images of a scanned PDF and recognizes characters visually using Tesseract.js, achieving over 95% accuracy on clean printed documents. Use Extract Text for digital PDFs; use OCR for scanned ones. The quick test: can you highlight text in your PDF reader? If yes, use Extract Text. If not, use OCR.

Why can't I copy text from my PDF?

Two reasons are common. First, your PDF may be scanned — pages are images with no text layer, so there's nothing for the clipboard to copy. Use the OCR tool for this. Second, the PDF may have copy restrictions set by its creator, which tell PDF readers to disable clipboard operations. FusionPDF's Extract Text tool reads the content stream directly and bypasses these display-level restrictions on unrestricted content, since the text data itself is still present in the file.

Does the OCR tool work on handwriting?

Tesseract.js is optimized for printed text and performs poorly on handwriting. Recognition accuracy typically drops to 40-60% on neat cursive, and lower on irregular handwriting. This is a fundamental limitation of how Tesseract's recognition models are trained. For reliable handwriting recognition, specialized tools with dedicated handwriting models (like Google Cloud Vision or Microsoft Azure Computer Vision) are more appropriate, though none of them are free for bulk use.

What languages does the OCR tool support?

FusionPDF's OCR tool currently supports five languages: English, French, Spanish, German, and Portuguese. Select the correct language before running OCR to load the matching Tesseract.js language model. The right language model improves accuracy on accented characters and language-specific letter combinations. If your document mixes two languages, choose the dominant one for best results.

Is there a page limit for text extraction or OCR?

There is no hard page limit on either tool. Extract Text handles large multi-page PDFs quickly since it reads the existing text layer directly. OCR processing time scales with page count — each page is rendered as an image and analyzed separately. A 50-page scanned PDF may take 2-3 minutes on an average laptop. The practical limit for both tools is your device's available RAM, not an artificial cap imposed by the service.

Extract Text from Your PDF — Free, No Upload

Two tools, any PDF. Native PDFs extract in seconds. Scanned PDFs run through Tesseract.js OCR in your browser — no file ever leaves your device.