# Character Archive Torrent Part 1

*The Character Archive is operated by Cyberes - chub-archive@evulid.cc - [@cyberes:evulid.cc](https://matrix.to/#/@cyberes:evulid.cc) - [char-archive.evulid.cc](https://char-archive.evulid.cc)*

**SNAPSHOT DATE:** May 15, 2024
**SNAPSHOT SIZE:** 87G
**ARCHIVED SIZE:** 76G


This is the first torrent of the rebranded and reorganized character card archive from `char-archive.evulid.cc`.


The archive consists of a PostgreSQL database containing the card info and definitions and a folder storing the hashed card images. These hashed images are organized into sub-folders based on the first three characters of their hash in the `image_hash` column. For example, the image with the hash `85c5d1f03fb851c7f5b297624b035783` would be located at `/8/5/c/5d1f03fb851c7f5b297624b035783`. Note: this hash is created from an image with its metadata stripped.


Prior torrents for the original "chub archive" are no longer required, unless you want the archive as it was before its reorganization.


#### Torrent File Download

[char-archive_part-1.torrent](https://chub-archive.evulid.cc/api/file/download?path=/takeout/char-archive_part-1.torrent&download=true)


## The Archive

Character and lorebook avatars are shrunk to a maximum of 1000 px by 1000 px and the resulting PNG files are compressed.

Character definitions are normalized to the spec V2 format and lorebooks are normalized to standardize their data. The original, unmodified definitions are stored alongside these reformatted versions and are available for download.


## Data

### chub.ai

Archive and mirror of [chub.ai](https://characterhub.org). 

Characters and lorebooks are scraped from chub.ai every hour or so. A complete scrape is performed at least once a month. 

### booru.plus

Character cards scraped from (the now-defunct) `booru.plus/+pygmalion`. Cards are sorted by author. Original comments are also displayed. 

The final scrape of booru.plus was on March 4, 2024 and the site went down sometime in late March or early April. Consequently, this is not a complete archive and an unknown amount of data is missing.

### Generic Character Cards

The scrapers crawl the web searching for character cards. One primary source are files stored on `catbox.moe`.

If a chub.ai author is found in a card's metadata, it is added to the chub.ai archive. 

**Imported Data Sources**

- Janitor AI: cards scraped from Janitor AI before they made card definitions private.
- Pygmalion Discord Server: this has not been imported into the archive. Contains cards from the Pygmalion Discord server up to 04-18-2023. 
- Roko's Basilisk: scrape of Roko's Basilisk, an early but influental frontend for chatbots which shut down after a week over concerns regarding OpenAI's terms of service. Contains the defs of many CAI bots that remain private on character.ai. Predates chub.ai and SillyTavern. Authors, where found, have been imported.
- VenusAI: VenusAI up to 05-27-2023, scraped by Koreans. 
- VenusAI Official Discord Server: cards from the official VenusAI Discord. This archive was created on 05-28-2023 and originally distributed as ai_characters_archive.zip.

### Historical

Original archives from third-parties. These are stored for historical reasons and most have been integrated into the generic character archive. 

### Logs

Logs from various proxies.

### Other

Miscellaneous data relating to chatbots and the Character Archive. Also contains info on the Character Archive torrents. 


## Contents

`char_archive.sql.7z`: the PostgreSQL database dump.

`hashed-data.7z`: the hashed data.

`proxy_stats.json.7z`: dump of the proxy stats Elasticsearch index using [elasticsearch-dump](https://github.com/elasticsearch-dump/elasticsearch-dump).

Organization of `files/`:

```
files/
├── historical
│   ├── [2.9G]  chub_ai-mega-nz_scrape.7z
│   ├── [511M]  Pygmalion_Discord_Server_04-18-2023.7z
│   ├── [ 78M]  Rokos_Basilisk_Archive.7z
│   ├── [1.2G]  venusai-chat_05-27-2023.7z
│   └── [1.6G]  Venusai-Official-Discord-Server.7z
├── logs
│   ├── [ 10M]  cute logs - Oct 14 2023.7z
│   └── vgdasfgadg c
│       ├── [ 14M]  1.7z
│       ├── [424M]  1 Dataset.7z
│       ├── [ 12M]  2.7z
│       ├── [ 12M]  3.7z
│       ├── [ 11M]  4.7z
│       ├── [ 12M]  5.7z
│       ├── [ 13M]  6.7z
│       ├── [4.8M]  7.7z
│       ├── [5.4M]  prompt-logs1.jsonl.7z
│       ├── [5.3M]  prompt-logs2.jsonl.7z
│       └── [1.2K]  README.md
├── other
│   ├── [8.0M]  404media - DIY Chatbots Unleash Large Language Models' Repressed Sexuality.pdf
│   ├── [ 824]  crack-prompt.txt
│   ├── [954K]  LLMjacking Stolen Cloud Credentials Used in New AI Attack.pdf
│   └── [131K]  Researchers Uncover 'LLMjacking' Scheme Targeting Cloud-Hosted AI Models.pdf
├── takeout
│   ├── [5.1K]  chub-archive_part-1.md
│   ├── [1.3M]  chub-archive_part-1.torrent
│   ├── [1.0K]  chub-archive_part-2.md
│   └── [1.5M]  chub-archive_part-2.torrent
```


## Mission Statement

Chatbots powered by artificial intelligence have been around for decades, but only recently have they become capable of engaging in human-like interactivity. Following the release of OpenAI's GPT-3.5 in March of 2022, creative individuals discovered that the AI could take on "personalities" and role-play a character. A community formed around chatting with these "bots" and sharing the "character cards" that defined a personality. Concerned about the capabilities of the AI and the creativity of the users, the corporations that owned the AI models took steps to restrict this activity, claiming it was "out of scope" and "unsafe". 

The Character Archive was created to protect this creativity.