A Python CLI tool for extracting and exporting metadata from Dataverse repositories. It supports bulk extraction of dataverses, datasets, and data file metadata from any chosen level of dataverse collection (whole Dataverse repository/sub-Dataverse), with flexible export options to JSON and CSV formats.
git clone https://github.com/scholarsportal/dataverse-metadata-crawler.git
cd ./dataverse-metadata-crawler
touch .env # For Unix/MacOS
nano .env # or vim .env, or your preferred editor
# OR
New-Item .env -Type File # For Windows (Powershell)
notepad .env
# .env file
BASE_URL = "TARGET_REPO_URL" # e.g., "https://demo.borealisdata.ca/"
API_KEY = "YOUR_API_KEY" # Find in your Dataverse account settings. You may also specify it in the CLI interface (with -a flag)
python3 -m venv .venv
source .venv/bin/activate # For Unix/MacOS
# OR
.venv\Scripts\activate # For Windows
pip install -r requirements.txt
python3 dvmeta/main.py [-a AUTH] [-l] [-d] [-p] [-f] [-e] [-s] -c COLLECTION_ALIAS -v VERSION
Required arguments:
Option | Short | Type | Description | Default |
---|---|---|---|---|
–collection_alias | -c | TEXT | Name of the collection to crawl. [required] |
None |
–version | -v | TEXT | The Dataset version to crawl. Options include: • draft - The draft version, if any • latest - Either a draft (if exists) or the latest published version • latest-published - The latest published version • x.y - A specific version [required] |
None (required) |
Optional arguments:
Option | Short | Type | Description | Default |
---|---|---|---|---|
–auth | -a | TEXT | Authentication token to access the Dataverse repository. If |
None |
–log –no-log |
-l | Output a log file. Use --no-log to disable logging. |
log (unless --no-log ) |
|
–dvdfds_metadata | -d | Output a JSON file containing metadata of Dataverses, Datasets, and Data Files. | ||
–permission | -p | Output a JSON file that stores permission metadata for all Datasets in the repository. | ||
–emptydv | -e | Output a JSON file that stores all Dataverses which do not contain Datasets (though they might have child Dataverses which have Datasets). | ||
–failed | -f | Output a JSON file of Dataverses/Datasets that failed to be crawled. | ||
–spreadsheet | -s | Output a CSV file of the metadata of Datasets. | ||
–help | Show the help message. |
# Export the metadata of latest version of datasets under collection 'demo' to JSON
python3 dvmeta/main.py -c demo -v latest -d
# Export the metadata of version 1.0 of all datasets under collection 'demo' to JSON and CSV
python3 dvmeta/main.py -c demo -v 1.0 -d -s
# Export the metadata and permission metadata of version 1.0 of all datasets under collection 'demo' to JSON and CSV, with the API token specified in the CLI interface
python3 dvmeta/main.py -c demo -v 1.0 -d -s -p -a xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx
File | Description |
---|---|
ds_metadata_yyyymmdd-HHMMSS.json | Datasets’ their data files’ metadata in JSON format. |
empty_dv_yyyymmdd-HHMMSS.json | The id of empty dataverse(s) in list format. |
failed_metadata_uris_yyyymmdd-HHMMSS.json | The URIs (URL) of datasets failed to be downloaded. |
permission_dict_yyyymmdd-HHMMSS.json | The perission metadata of datasets with their dataset id. |
pid_dict_yyyymmdd-HHMMSS.json | Datasets’ basic info with hierarchical information dictionary.Only exported if -p (permission) flag is used without -d (metadata) flag. |
pid_dict_dd_yyyymmdd-HHMMSS.json | The Hierarchical information of deaccessioned/draft datasets. |
ds_metadata_yyyymmdd-HHMMSS.csv | Datasets’ their data files’ metadata in CSV format. |
log_yyyymmdd-HHMMSS.txt | Summary of the crawling work. |
exported_files/
├── json_files/
│ └── ds_metadata_yyyymmdd-HHMMSS.json # With -d flag enabled
│ └── empty_dv_yyyymmdd-HHMMSS.json # With -e flag enabled
│ └── failed_metadata_uris_yyyymmdd-HHMMSS.json
│ └── permission_dict_yyyymmdd-HHMMSS.json # With -p flag enabled
│ └── pid_dict_yyyymmdd-HHMMSS.json # Only exported if -p flag is used without -d flag
│ └── pid_dict_dd_yyyymmdd-HHMMSS.json # Hierarchical information of deaccessioned/draft datasets
├── csv_files/
│ └── ds_metadata_yyyymmdd-HHMMSS.csv # with -s flag enabled
└── logs_files/
└── log_yyyymmdd-HHMMSS.txt # Exported by default, without specifying --no-log
No tests have been written yet. Contributions welcome!
If you use this software in your work, please cite it using the following metadata.
APA:
Lui, L. H. (2025). Dataverse Metadata Crawler (Version 0.1.0) [Computer software]. https://github.com/scholarsportal/dataverse-metadata-crawler
BibTeX:
@software{Lui_Dataverse_Metadata_Crawler_2025,
author = {Lui, Lok Hei},
month = jan,
title = ,
url = {https://github.com/scholarsportal/dataverse-metadata-crawler},
version = {0.1.0},
year = {2025}
}
Ken Lui - Data Curation Specialist, Map and Data Library, University of Toronto - kenlh.lui@utoronto.ca