Installing Tabula on windows:
Getting Started
Requirements
- Java
- Java 8+
- Python
- 3.5+
Installation
Before installing tabula-py, ensure you have Java runtime on your environment.
You can install tabula-py form PyPI with pip command.
pip install tabula-py
Note
conda recipe on conda-forge is not maintained by us. We recommend to install via pip to use latest version of tabula-py.
Get tabula-py working (Windows 10)
This instruction is originally written by @lahoffm. Thanks!
- If you don’t have it already, install Java
- Try to run example code (replace the appropriate PDF file name).
- If there’s a
FileNotFoundErrorwhen it callsread_pdf(), and when you typejavaon command line it says'java' is not recognized as an internal or external command, operable program or batch file, you should setPATHenvironment variable to point to the Java directory. - Find the main Java folder like
jre...orjdk.... On Windows 10 it was underC:\Program Files\Java - On Windows 10: Control Panel -> System and Security -> System -> Advanced System Settings -> Environment Variables -> Select PATH –> Edit
- Add the
binfolder likeC:\Program Files\Java\jre1.8.0_144\bin, hit OK a bunch of times. - On command line,
javashould now print a list of options, andtabula.read_pdf()should run.
Okay, lets start working with tabula-py
import tabula
# Read pdf into DataFrame
df = tabula.read_pdf("test.pdf", options)
# Read remote pdf into DataFrame
df2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")
# convert PDF into CSV
tabula.convert_into("test.pdf", "output.csv", output_format="csv")
# convert all PDFs in a directory
tabula.convert_into_by_batch("input_directory", output_format='csv')# convert pages if you have many pages
tabula.convert_into("test.pdf", "output.csv", output_format="csv", pages=(1,2,3))
#after converting it into csv, you can use the extracted csv file as the pandas dataframe.
No comments:
Post a Comment