This project utilizes Ollama to run a Generative AI Model locally for extracting metadata from files.
The selected model is Llama 3.1 (8b-instruct-q8_0), an instruct-based model with 8 billion parameters and quantization 8, chosen for its balance between accuracy and efficiency.
- Extract Basic Metadata using Apache Tika.
- Determine File Type:
- PDF Files: Extract detailed metadata using PyPDF2.
- Spreadsheet Files: Extract metadata using Pandas.
- Process PDFs:
- Divide content into batches.
- Send requests to Ollama API with a prompt for table extraction in strict JSON format.
- Process Spreadsheets:
- Convert them into a Pandas DataFrame.
- Send the DataFrame to Ollama API for table extraction.
- Clean Responses:
- Utilize JSONaut API to clean the JSON output.
- Output Metadata:
- Consolidate metadata into a structured JSON object.
- Store results in a file:
metadata_results.json.
- Output PDF Report:
- Read
metadata_results.jsonto create a readable PDF report of all the metadata.
- Read
- Llama 3.1 extracts metadata, including:
- Table name
- Column headers
- Data types
- Table descriptions
- JSONaut API cleans the JSON output.
Download and install Ollama on your machine. Then, open a terminal and run:
ollama run llama3.1:8b-instruct-q8_0This will download the Llama 3.1 model and set up an API to send requests.
Save the project code in a folder and install required Python libraries:
pip install- Create a free account on JSONaut.
- Get your API key (allows up to 8000 characters per request).
- Replace
YOUR_API_KEYon line 73 of the code with your actual API key.
- Create a folder named
filesin the project directory. - Place all files to be processed inside
files.
Execute the script to generate metadata:
python metadata_extractor.pyThis will create a JSON file: metadata_results.json.
To create PDF reports of the extracted metadata:
python pdf_generator.pyThis will generate PDF reports for each file inside the files folder.
- metadata_results.json (contains extracted metadata in JSON format).
- PDF reports summarizing metadata.
- Ensure Ollama and JSONaut API are correctly set up before running the scripts.
- The project requires an internet connection to download the AI model initially and to send requests to JSONaut.