Text-searchable scanning (OCR)
This topic covers:
Overview
Optical Character Recognition (OCR) is the process of taking an image, such as a scanned document, and reconstructing its text. This allows scanned documents to become searchable and/or editable.
Text-searchable documents have two major benefits over other scan outputs:
-
You can search for specific content within the document.
-
If the document has been added to a document management system, you can find the document by searching for its content.
Performing OCR is a resource intensive process that can add seconds or tens of seconds per page to the time it takes to deliver a document. For this reason, enable OCR on scan actions where it is most useful, not where fast delivery is more important.
Currently PaperCut MF supports the following text-searchable file types:
-
PDF (text-searchable)—PDF v1.5 with PDF/A-1B compliance according to the requirements defined by the PDF/A standard.
-
DOCX
OCR processing in the cloud or on-premise
PaperCut MF provides the ability to run the OCR process using the PaperCut MF Cloud OCR service (one of PaperCut's Cloud Services) or using your on-premise infrastructure:
-
PaperCut Cloud OCR service—Let the cloud do all the heavy lifting and benefit from:
-
improved local infrastructure performance
-
automatically deployed service updates. Always have the latest performance improvements and functionality.
The PaperCut MF Cloud OCR service processes concurrent jobs in parallel and handles any scaling of the service, even when there is a high user load.
-
-
locally hosted OCR (On-premise)—For use when there's a requirement to host OCR on premise and you have a high performing Application ServerAn Application Server is the primary server program responsible for providing the PaperCut user interface, storing data, and providing services to users. PaperCut uses the Application Server to manage user and account information, manage printers, calculate print costs, provide a web browser interface to administrators and end users, and much more.. Some organizations have a requirement for data to stay within their own infrastructure or even on their own premises, typically for regulatory or compliance reasons. Be aware that this involves installing the service on selected infrastructure and keeping it updated by installing new versions. For more information, take a look at the OCR FAQs, or to get started, see Set up locally hosted OCR (On-premise)
IMPORTANT-
The locally hosted OCR (On-premise) solution requires the On-prem OCR & Document Processing Pack once the trial period is finished. For more information, contact your local Authorized Solution Center or reseller.
-
The locally hosted OCR (On-premise) solution is available only for Windows.
-
Supported languages
OCR supports extracting text for approximately 100 languages. You can choose to use up to 10 of those languages, however for the best performance, limit your choices to a maximum of four languages.

A | F | L | S |
Afrikaans | Faroese | Lao | Sanskrit |
Albanian | Filipino | Latin | Scottish Gaelic |
Amharic | Finnish | Latvian | Serbian |
Arabic | Flemish | Letzeburgesch | Sindhi |
Armenian | Franksh | Lithuanian | Sinhala; Sinhalese |
Assamese | French | Luxembourgish | Slovak |
Azerbaijani | G | M | Slovenian |
B | Gaelic | Macedonian | Spanish |
Basque | Galician | Malay | Sundanese |
Belarusian | Georgian | Malayalam | Swahili |
Bengali | German | Maltese | Swedish |
Bosnian | Greek | Maldivian | Syriac |
Breton | Gujarati | Maori | T |
Bulgarian | H | Marathi | Tagalog |
Burmese | Haitian | Moldavian | Tajik |
C | Haitian Creole | Moldovan | Tamil |
Catalan | Hebrew | Mongolian | Tatar |
Cebuano | Hindi | N | Telugu |
Cental Khmer | Hungarian | Nepali | Thai |
Cherokee | I | Northern Kurdish | Tibetan |
Chinese - Simplified | Icelandic | Norwegian | Tigrinya |
Chinese - Traditional | Indonesian | Occitan (post 1500) | Tonga (Tonga Islands) |
Corsican | Inuktitut | Oriya | Turkish |
Croatian | Irish | P | U |
Czech | Italian | Panjabi | Uighur |
D | J | Pashto Persian | Ukrainian |
Danish | Japanese | Pilipino | Urdu |
Dhivehi | Javanese | Polish | Uyghur |
Divehi | K | Portuguese | Uzbek |
Dutch | Kannada | Punjabi | V |
Dzongkha | Kirghiz;Kyrgyz | Pushto | Valencian |
E | Kazakh | Q | Vietnamese |
English | Korean | Quechua | W |
Esperanto | Kurdish | R | Welsh |
Estonian | Romanian | Western Frisian | |
Russian | Y | ||
Yiddish | |||
Yoruba |
FAQs

As far as the output is concerned, no.
Cloud OCR accepts scan data from around the world and processes it in the region chosen by the organization. This means that data might travel outside or be processed outside the country of origin.
Cloud OCR scales according to OCR job requirements, whereas locally hosted OCR (On-premise) requires you to manage local infrastructure and manually install server updates.

Yes, as long as your OCR server has access to the internet and is not behind a firewall.

If you have multiple OCR servers and one of them goes offline, any jobs in the queue for that server will be automatically transferred to a different OCR server for processing.

This will be available in a future release.

Yes. OCR scan jobs will be identified in the XML metadata file by the file type and also a new element specifying whether or not OCR is enabled. For more information see, Integration with Electronic Document Management Systems.

PaperCut MF supports thousands of MFD models with varying display panel sizes and resolutions, so we need to cater for the smallest display panels. This means we are limited to being able to display a maximum of three file type choices at the MFD.