this post was submitted on 13 Aug 2024
39 points (100.0% liked)

Open Source

30208 readers
202 users here now

All about open source! Feel free to ask questions, and share news, and interesting stuff!

Useful Links

Rules

Related Communities

Community icon from opensource.org, but we are not affiliated with them.

founded 5 years ago
MODERATORS
 

I have received a lot of PDF documents that I wish to convert to text formats such as docx/doc/odt.

I know there are some online tools that will do it for you, but some content may be sensitive with people's names and addresses and I'm not sure I can trust these websites.

Are there software that will convert a PDF to odt?

Things I know and tried:

  1. Asked a friend to open PDF in Microsoft Word: Their license expired last month, so it doesn't let you save the file!

  2. Tried to do the same on my LibreWriter: It doesn't support that format.

  3. Tried to open in LibreDraw: untenable as I want to type more things in the document.

P.S: I use Linux, but reckon solutions for platforms would be fine.

top 14 comments
sorted by: hot top controversial new old
[–] Walking_coffin@lemmy.dbzer0.com 13 points 1 month ago* (last edited 1 month ago) (1 children)

If the pdf files are properly formatted (no compression/all text selectable), you should be able to open a terminal and do (I know it works the other way around, not sure if libreoffice can actually do the reverse but it doesn't hurt to try)

libreoffice --headless --convert-to docx *.pdf

Just know that since docx is a proprietary format by microsoft, the results may be flawed. As a last resort I guess you could run a windows VM and try to convert your files with any big software known to be able to handle such files.

[–] gedaliyah@lemmy.world 11 points 1 month ago (1 children)
[–] Walking_coffin@lemmy.dbzer0.com 4 points 1 month ago

Thanks for the information. I wasn't aware of that.

[–] INeedMana@lemmy.world 11 points 1 month ago* (last edited 1 month ago)

I think it will depend on what exactly is in the PDF. If these are text, you can in a pinch just copy and paste it but I'd expect libreoffice to be able to open it. If these are images, you'll have to use some OCR

[–] Max_P@lemmy.max-p.me 5 points 1 month ago

PDFs are inherently not designed to be edited, the format lacks a lot of the information necessary for layouts to work correctly and as expected.

That's why you have to open it with LibreOffice Draw, and the mess you see is basically the information that's contained within the PDF. It is just a bunch of random text cells randomly placed over the page. That makes it really difficult to get back an editable version that's sensible. Page wraps and such will never work correctly. Your only chance at recovering it is if you can figure out what software wrote it, and how different constructs might end up when translated to PDF and a lot of heuristics.

I believe they open a bit better in Xournal++ but it still sucks.

Those that do build such tools realize it's all big companies with big budgets that really have a serious need to do this, so they tend to be proprietary and expensive, and still not super great.

I would really beg for the files to be provided in a suitable format for editing.

[–] Dotdev@programming.dev 4 points 1 month ago (2 children)
[–] Maroon@lemmy.world 1 points 1 month ago (1 children)

I have tried it a few times in the past to convert latex to odt. It didn't work very well for me and the work flow isn't very extensible when working with multiple documents (at least in my very limited experienced nice).

Maybe it has become better now??

[–] Dotdev@programming.dev 1 points 1 month ago

It works fine just the text would not be in the same format but the text should be fine

[–] halm@leminal.space 1 points 1 month ago (1 children)

I think Pandoc only converts to PDF? Maybe Poppler will do the trick.

[–] Dotdev@programming.dev 1 points 1 month ago (1 children)

It can convert to other formats but it requires extra dependencies for it to fully work

[–] halm@leminal.space 1 points 1 month ago

Well yes, pandoc converts between all sorts of files but AFAIR it's not great converting FROM pdf.

[–] jjlinux@lemmy.ml 4 points 1 month ago

I use self-hosted STIRLING-PDF.

[–] trex@anonsys.net 1 points 1 month ago

@Maroon
Vielleicht hilft dir der Link aus der LibreOffice 24.2 Help weiter:

Tabellen mit Filternamen für die Dokumentkonvertierung mittels Befehlszeile.

[–] monobot@lemmy.ml -1 points 1 month ago

Short answer: No.