Download Data

Download the complete Mahua Word Embeddings dataset. Files are organized by type and year. All data is provided in JSON, CSV, and TXT formats.

Complete Dataset

All Files (ZIP)

Download everything in a single archive:

DescriptionSizeFormat
Complete Dataset (all years, all models)~20 MB.zip

Model Data by Year

Word embeddings (Word2Vec, FastText, BERT) for each year of the corpus.

1955 (4 files)
Word2Vec Model
1955_model_data_word2vec.json • ~200 KB
Download
FastText Model
1955_model_data_fasttext.json • ~200 KB
Download
BERT Model
1955_model_data_bert.json • ~400 KB
Download
Combined (All Models)
1955_model_data_word2vec_fasttext_bert.json • ~800 KB
Download
1956-1960 (6 years)

Similar structure to 1955. Each year contains:

  • *_model_data_word2vec.json
  • *_model_data_fasttext.json
  • *_model_data_bert.json
  • *_model_data_word2vec_fasttext_bert.json
1961 (1 file)
All Models Combined
1961_model_data_word2vec_fasttext_bert.json • ~200 KB
Download

Rationality Analysis Data

Similarity networks for 7 rationality-related concepts (1959_04 jf78).

Model Data (jf78 subcorpus)
Word2Vec (jf78)
1959_04_1_jf78_model_data_word2vec.json • ~100 KB
Download
FastText (jf78)
1959_04_1_jf78_model_data_fasttext.json • ~100 KB
Download
BERT (jf78)
1959_04_1_jf78_model_data_bert.json • ~300 KB
Download
Similarity Networks (7 concepts × 6 methods)

Each concept has 6 methods: cosine, euclidean, manhattan, jaccard, pearson, spearman

Formats: CSV (tabular), JSON (network plot), HTML (interactive visualization)

Embedding Visualizations

Dimensionality reduction plots (2D/3D) using PCA, t-SNE, and UMAP.

2D Visualizations (1959_04 jf78)
PCA 2D
Multi-model visualization • ~500 KB
Browse
t-SNE 2D
Multi-model visualization • ~500 KB
Browse
UMAP 2D
Multi-model visualization • ~500 KB
Browse
3D Visualizations
PCA 3D
Multi-model visualization
Browse
t-SNE 3D
Multi-model visualization
Browse
UMAP 3D
Multi-model visualization
Browse

Corpus Files

Original text files organized by year.

All Text Files
59 .txt files • ~5 MB total
Browse Corpus

GitHub Repository

The complete dataset is available on GitHub. You can clone the repository or download specific files:

File Format Guide