New ask Hacker News story: Ask HN: How to OCR a PDF and preserve whitespace?
Ask HN: How to OCR a PDF and preserve whitespace?
3 by GirkovArpa | 1 comments on Hacker News.
I have some rather large PDFs that need to be transcribed, but every service I try has some minor but deal-breaking flaw. Either they don't support PDFs this large (hundreds of pages), are just really bad at English OCR, or, most commonly, don't preserve whitespace correctly. The number one problem is whitespace when it comes to multiple columns (similar to newspapers). Either not putting any spaces between words, or when there are multiple columns of text, putting rows in the wrong order. If it was just a single page, this would still be useful, since I could fix it myself. But I have over 1000 pages. I tried so many free services and trials that I just got charged for forgetting to cancel one (thanks to smallpdf.com for refunding my $12). Is OCR technology just not there yet when it comes to multiple-column pages? Yet, this does not seem to be an issue with newspapers.com, based on my experience using their text search feature. I would like to know what OCR software they are using.
3 by GirkovArpa | 1 comments on Hacker News.
I have some rather large PDFs that need to be transcribed, but every service I try has some minor but deal-breaking flaw. Either they don't support PDFs this large (hundreds of pages), are just really bad at English OCR, or, most commonly, don't preserve whitespace correctly. The number one problem is whitespace when it comes to multiple columns (similar to newspapers). Either not putting any spaces between words, or when there are multiple columns of text, putting rows in the wrong order. If it was just a single page, this would still be useful, since I could fix it myself. But I have over 1000 pages. I tried so many free services and trials that I just got charged for forgetting to cancel one (thanks to smallpdf.com for refunding my $12). Is OCR technology just not there yet when it comes to multiple-column pages? Yet, this does not seem to be an issue with newspapers.com, based on my experience using their text search feature. I would like to know what OCR software they are using.
Comments
Post a Comment