Camelot is a Python library that can help you extract tables from PDFs. PyPDF2 is not able to extract tables nicely, and tabula-py is depending on Java.
Just like with any other python library, the installation starts naively with a simple:
pip install camelot-py
Installation will download and install dependency libraries too, but once you run your sample code you will receive the following error message:
ModuleNotFoundError: No module named 'cv2'
Ouch! Looks like not all dependency libraries have been installed. Yep. The issue has been reported already and a workaround is suggested.With a sigh of relief, we proceed with:
pip install opencv-python
Let's try to run the sample code again. Ooops, a new error message (that means we are moving forward, after all!):
PyPDF2.errors.DeprecationError: PdfFileReader is deprecated and was removed in PyPDF2 3.0.0. Use PdfReader instead.
Yep. That issue has been reported too, and we have a workaround:
pip install "PyPDF2<3.0"
So let's get back and try to run our sample code. Again we have made a progress and reached to a new error message! This time it is:
OSError: Ghostscript is not installed. You can install it using the instructions here: https://camelot-py.readthedocs.io/en/master/user/install-deps.html
We follow the suggested url and install Ghostscript. After trying to run the sample code once again, the following error message pops out:
ModuleNotFoundError: No module named 'ghostscript'
Let's install ghostscript python library with the following command:
pip install ghostscript
Believe it or not, I just ran the sample code and for now it looks like that was it regarding the installation!
P.S: Note to self, the last version of camelot-py that works for Python 2.7 is 0.7.3.
No comments:
Post a Comment