Converting .pages to .odt (with a little help from AI)
The issue
The co-advisor for my MA thesis passed away last year and his widow allowed me to copy some unfinished drafts and documents from his computer for posterity. He was an avid Mac user and every document I ever received from him was a .docx file. So, to my surprise, the majority of the files that I gleaned from his desktop were actually Apple Pages (.pages) files. Some were already converted into .docx and some were even rendered as PDF, but I wanted to see if they could be converted to .odt with something like pandoc. Alas, pandoc doesn’t handle .pages files, but surprisingly LibreOffice (my office suite of choice) does! I was already familiar with how to convert documents on the commandline:
libreoffice --convert-to odt foo.docx
But my professor had Pages files in all sorts of directories. Some had descriptive names, others did not. Another issue I found was that LibreOffice can only open and convert .pages files that are underlyingly a ZIP archive. Some of the .pages files were directories. So, I not being a very good bash scripter and not knowing where to begin did my most “2024 thing” yet. I asked ChatGPT for help.
Hiring ChatGPT for the job
ChatGPT comes with its own ethical issues that I won’t go into here. ChatGPT is also sometimes inaccurate, which is a problem I faced when asking it to help code. I gave it a basic rundown at first:
I would like to code an apple pages –> LibreOffice Writer .odt converter in Linux bash script. Here’s the issue.
- Some .pages documents are zip files. These can be opened and converted with libreoffice –convert-to odt $INPUT_FILE with no problem.
- Some .pages documents are actually directories. In order to convert them with LibreOffice, we need to take out the contents of that directory, and zip them as a zip file with a .pages extension.
Here’s what I’d like the script to do:
- Go through all of the files in a given directory ending in .pages and determine whether the particular .pages item is a zip file or a directory. If a ZIP file, make a temporary copy in a working directory to be converted. If a directory, zip the CONTENTS of that directory into a ZIP folder with the same name of that directory and then move it into the aforementioned working directory.
- use the libreoffice –convert-to odt command on all of the newly moved .pages ZIP files in the working directory
- Move all of the newly-converted .odt files from the previous command into the original directory.
Each step should have echo commands telling the user which files are being worked on. To save time and maximize resources, implementing those commands with GNU Parallel would be a good decision, I think. Could you help me implement this, please?
ChatGPT was more than willing to help! It generated a script complete where the user would input the source directory and a working directory and then the script would take the necessary steps to convert. That wasn’t really good enough for me, though.
I asked ChatGPT to set /tmp/pages_to_odt/ as the default working directory (the user could always change it in the script and it’s going to be cleared, anyway). ChatGPT implemented it.
The trouble came when I asked ChatGPT to implement the -R flag for recursive searching. The script in its first form only looked for and converted the .pages files in the current directory. To save time, I wanted to give the option to search in all sub-directories. ChatGPT obliged, changed the script accordingly, or so I thought.
The script was actually interpreting all files in the source directory and sub-directories as .pages files, copying them to the working directory, and then adding the .pages extension to them. Not only was this depleting tons of space on my drive, but it was wasting processing resources, since LibreOffice would obviously not be able to convert images files, PDFs, and whatever else was there. So, I pumped the brakes and gently reminded ChatGPT that the files had to be .pages files ALREADY ending in the .pages extension. It actually took two more iterations for ChatGPT to understand what I was getting at and to squash the bug.
At that point, I said, “we’re 90% of the way there!” The only problem was that files were not being moved back to their respective original locations. They were all being moved back to the initial source directory. Not only that, but hidden (with preceding .) .pages files were added to the batch and they were often incomplete, so the script was throwing “unable to convert .foo.pages” errors, which threw me a false flag that something was wrong.
The file indexing and moving saga is pretty complex and I have already taken up a lot of space here. Suffice it to say that it took about 3 more iterations of the script to get the file moving functionality to work.
However… ChatGPT kept forgetting about the .pages DIRECTORIES compared to the .pages ZIP files. The ZIP files were converting just fine, but twice or thrice ChatGPT completely forgot to keep the directory to zip conversion in the script. So, I had to remind it each time. Sometimes when ChatGPT put it back in, it changed how it implemented it. To be honest, I was too chicken to just copy and paste it in myself because, well, indenting matters in bash scripts!
In the end, and maybe 2 hours into the process, I got a working script, which you can find on my GitHub. I now want to talk about the results of it and what I learned about using ChatGPT.
Results and ChatGPT usage
In the end, most of the .pages files converted properly. The others did not convert properly not because of my script, but rather because LibreOffice apparently only recognizes and opens .pages files that have an underlying XML structure. The .pages files that did not convert have .iwa files in their archives and it seems LO cannot open those. My current thought is to borrow somebody’s Mac (or perhaps spin up a Mac VM) and manually convert each of those ones in Pages.
Now, onto ChatGPT thoughts… Honestly, if I was a better programmer, I probably would have spent less time scripting this than I did going back and forth with ChatGPT. Its code spat so many errors and had to be corrected so many times, that I honestly do not feel like I was cheating by using ChatGPT. If I didn’t know how to debug and I didn’t know the proper way to describe what I wanted, ChatGPT would have NEVER given me a working script. I feel like I was the supervisor and ChatGPT was my minion. I truly feel that I was the designer of this script, even though I did not know how to write every step.
But, for those purists that disagree with me, my GitHub repo is public and I am welcome to any pull requests from actual humans to help improve the script and make it more man made. I especially would like to know whether anyone knows a way to convert the .iwa pages files into something LibreOffice could convert!
All in all, I am happy that I was able to use a tool that I directed the creation of to break my late professor’s papers out of Apple’s walled garden and to save them in posterity in the OpenDocument format.