Tested: Ventura 13.5 (Swift 5.8.1), Monterey 12.6.8 (Swift 5.7.2)Ĭompiled: swiftc -Osize -o pdf2text pdf2text.swift -framework Foundation -framework AppKit -framework PDFKit Works on PDFĭocuments correctly where Adobe Acrobat Reader DC mangles the result. Script to read n-tuple PDF provided on command line and extract text toįile in the same location with the ".txt" extension. Can process multiple PDFs on the command line into their text file equivalents. Works perfectly with all of the PDFs I tested against Adobe's product, and no concatenated words. No concatenation of words, and none of the Adobe misdeeds. So I got fed up with this nonsense and wrote a brief Swift script that just generates a correct text file regardless of the PDF it ingests. This particular PDF was generated by TeXShop and appeared normally in Acrobat Reader and Preview. Looks like this:Īnd although one can remove the carriage returns, it does beg the question what Adobe is doing injecting carriage returns on a UNIX machine where linefeeds are the norm. Worse, on macOS Monterey, one PDF when saved as text, the result was concatenated words mixed with individual words with a trailing space and a carriage return (^M). It may decide to split the text file as one word per line, or it may actually get the text word extraction right, but with random concatentation of text words. What I discovered on both platforms is that the Adobe product may generate an empty text file, whether its PDF origin is Pages, LibreOffice Writer, or TexShop. The identical PDF files were used on both instances of macOS. I have tested Adobe Acrobat Reader DC (v.2023.003.20269) on macOS Monterey 12.6.8, and Ventura 13.5.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |