Installing Tabula on windows:
Getting Started
Requirements
- Java
- Java 8+
- Python
- 3.5+
Installation
Before installing tabula-py, ensure you have Java runtime on your environment.
You can install tabula-py form PyPI with pip
command.
pip install tabula-py
Note
conda recipe on conda-forge is not maintained by us. We recommend to install via pip
to use latest version of tabula-py.
Get tabula-py working (Windows 10)
This instruction is originally written by @lahoffm. Thanks!
- If you don’t have it already, install Java
- Try to run example code (replace the appropriate PDF file name).
- If there’s a
FileNotFoundError
when it callsread_pdf()
, and when you typejava
on command line it says'java' is not recognized as an internal or external command, operable program or batch file
, you should setPATH
environment variable to point to the Java directory. - Find the main Java folder like
jre...
orjdk...
. On Windows 10 it was underC:\Program Files\Java
- On Windows 10: Control Panel -> System and Security -> System -> Advanced System Settings -> Environment Variables -> Select PATH –> Edit
- Add the
bin
folder likeC:\Program Files\Java\jre1.8.0_144\bin
, hit OK a bunch of times. - On command line,
java
should now print a list of options, andtabula.read_pdf()
should run.
Okay, lets start working with tabula-py
import tabula
# Read pdf into DataFrame
df = tabula.read_pdf("test.pdf", options)
# Read remote pdf into DataFrame
df2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")
# convert PDF into CSV
tabula.convert_into("test.pdf", "output.csv", output_format="csv")
# convert all PDFs in a directory
tabula.convert_into_by_batch("input_directory", output_format='csv')
# convert pages if you have many pages
tabula.convert_into("test.pdf", "output.csv", output_format="csv", pages=(1,2,3))
#after converting it into csv, you can use the extracted csv file as the pandas dataframe.
No comments:
Post a Comment