Metadata Extractor

Overview

This project utilizes Ollama to run a Generative AI Model locally for extracting metadata from files.

The selected model is Llama 3.1 (8b-instruct-q8_0), an instruct-based model with 8 billion parameters and quantization 8, chosen for its balance between accuracy and efficiency.

Approach

Extract Basic Metadata using Apache Tika.
Determine File Type:
- PDF Files: Extract detailed metadata using PyPDF2.
- Spreadsheet Files: Extract metadata using Pandas.
Process PDFs:
- Divide content into batches.
- Send requests to Ollama API with a prompt for table extraction in strict JSON format.
Process Spreadsheets:
- Convert them into a Pandas DataFrame.
- Send the DataFrame to Ollama API for table extraction.
Clean Responses:
- Utilize JSONaut API to clean the JSON output.
Output Metadata:
- Consolidate metadata into a structured JSON object.
- Store results in a file: metadata_results.json.
Output PDF Report:
- Read metadata_results.json to create a readable PDF report of all the metadata.

Use of Generative AI

Llama 3.1 extracts metadata, including:
- Table name
- Column headers
- Data types
- Table descriptions
JSONaut API cleans the JSON output.

Setup Instructions

1. Install Ollama

Download and install Ollama on your machine. Then, open a terminal and run:

ollama run llama3.1:8b-instruct-q8_0

This will download the Llama 3.1 model and set up an API to send requests.

2. Install Dependencies

Save the project code in a folder and install required Python libraries:

pip install

3. Set Up JSONaut API

Create a free account on JSONaut.
Get your API key (allows up to 8000 characters per request).
Replace YOUR_API_KEY on line 73 of the code with your actual API key.

4. Prepare Your Files

Create a folder named files in the project directory.
Place all files to be processed inside files.

5. Run the Metadata Extractor

Execute the script to generate metadata:

python metadata_extractor.py

This will create a JSON file: metadata_results.json.

6. Generate PDF Reports

To create PDF reports of the extracted metadata:

python pdf_generator.py

This will generate PDF reports for each file inside the files folder.

Output

metadata_results.json (contains extracted metadata in JSON format).
PDF reports summarizing metadata.

Notes

Ensure Ollama and JSONaut API are correctly set up before running the scripts.
The project requires an internet connection to download the AI model initially and to send requests to JSONaut.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
extracted_metadata		extracted_metadata
files		files
README.md		README.md
extractor.py		extractor.py
pdf_generator.py		pdf_generator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Metadata Extractor

Overview

Approach

Use of Generative AI

Setup Instructions

1. Install Ollama

2. Install Dependencies

3. Set Up JSONaut API

4. Prepare Your Files

5. Run the Metadata Extractor

6. Generate PDF Reports

Output

Notes

About

Uh oh!

Releases

Packages

Languages

mon4our/metadata-extractor

Folders and files

Latest commit

History

Repository files navigation

Metadata Extractor

Overview

Approach

Use of Generative AI

Setup Instructions

1. Install Ollama

2. Install Dependencies

3. Set Up JSONaut API

4. Prepare Your Files

5. Run the Metadata Extractor

6. Generate PDF Reports

Output

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages