Sometimes a PDF will use an embedded font which does not define specific unicode characters making it virtually impossible to convert the text. The PDF can be displayed or printed because the font uses various drawing commands (bezier curves) that renders a character (known as “glyph”) as if it was a small “graphical image”. Technically, a “text stream” inside the PDF is comprised of various “codes” and these values are sent to the embedded font which in turn looks up the commands in order to draw the glyphs. Most PDFs either have a specific font encoding for the codes or a “ToUnicode” table that can be used to lookup and map codes to unicodes. Unfortunately, some PDFs do not contain the additional information required to be able to convert the codes to unicode.
An interesting test you can perform to prove this problem is to open the “problem” PDF in an application such as Adobe Acrobat, then copy and paste the text into a Text Editor and you will see “garbage” characters displayed. This is because the characters are essentially raw codes from the text stream (which are either Glyph IDs or indices into the Glyph drawing tables). The only solution to convert (or extract) the text from such PDFs is to perform advanced OCR (Optical Character Recognition), but even then the accuracy of obtaining the exact unicode values may not be 100% accurate.
Consequently, when the unicode values are unknown, PDF2DTP will substitute the characters with either a tilde “~” or a space (depending on your PDT2DTP Preferences setting). The tilde characters are therefore used as “markers” so that after the conversion is complete you can then use Find/Change and search for the tilde characters in order to manually edit them. Sometimes it is difficult to locate a tilde character, especially if it exists within overset text. One helpful tip is to select “Edit in Story Editor” (under the Edit menu) and you should then be able to see its exact location in the story.
However, PDF2DTP offers a solution that allows you to define the unknown characters, thus avoiding the tilde substitutions. This process uses the PDF2DTP Characters Editor.
An embedded font within a PDF is usually a sub-set of the characters of the original font and consists of only those characters which are actually used in the text stories. For example, if the font is used for the word “hello” then only four characters need to be defined: “h”, “e”, “l” and “o”. There are usually only a small handful of characters that require editing, but sometimes there can be a fairly large quantity, and some fonts can use dozens of embedded fonts which do not define unicodes. While the task of editing all the characters may seem tedious, the process pales in comparison to having to retype the entire stories. Besides, once you’ve edited the characters the information will be saved to disk so that it can be used in subsequent PDFs that reference the same font.
Therefore, the PDF2DTP Characters Editor allows you to achieve highly accurate text conversion, that is of course, depending on how thorough and exact your editing is.