About This Dataset

Project Overview

The Mahua Word Embeddings dataset is a Digital Humanities resource containing word embeddings trained on Traditional Chinese texts from Mahua (馬華) literature — literary works published in Malaya from 1955 to 1961.

The corpus includes issues from the journal "蕉風" (Jiāo Fēng), representing a significant body of Chinese diaspora literature during the formative years of the Malayan Chinese community.

This dataset enables researchers to study the semantic evolution of Chinese diaspora literature, analyze the vocabulary of the Malayan Chinese community during the mid-20th century, and explore how concepts related to humanism and rationality were expressed in this unique literary tradition.

Project Aim

The project seeks to move beyond traditional frameworks—Malayan (national assimilation) and Chinese (diasporic or linguistic)—in analyzing the Mahua literary journal. Instead, it explores conceptual and thematic connections through digital humanities methods, engaging with broader literary and cultural debates.

A key research focus is the computational analysis of important concepts and keywords as they appear in various forms throughout the journal. The project leverages BERT and other word embedding models to better capture contextual meanings and semantic nuances across the literary corpus, enabling new insights into the evolution of ideas and themes over time.

Dataset Structure

This dataset is organized into three main components:

corpus/

Contains the original text files organized by year (1955-1961), with each file representing a specific half-month issue (1955–1958) or single monthly issue (November 1958 onward) following the YYYY-MM-first|second-issue-XXX.txt convention.

rationality-related/

Contains specialized folders for analysis of rationality-related content from specific issues (1959_04_1_jf78 and 1959_05_1_jf79), retained for historical traceability.

yearly-based-model-data/

Contains processed model data organized by year for embedding and analysis purposes.

Corpus Coverage

Year	Issues	Character Count	Notes
1955	4	93,773	Nov–Dec only
1956	24	647,956	Complete coverage
1957	24	669,896	Complete coverage
1958	21	617,044	Complete coverage; monthly publication from November 1958 onward
1959	12	431,252	Single monthly issues
1960	12	386,075	Single monthly issues
1961	1	32,315	Jan only

Total: 98 issues • 2,878,311 characters

Methodology

Corpus Preprocessing ▼

Text Collection: Original texts were digitized from journal issues of 蕉風 (Jiao Feng)
Encoding: Preserved Traditional Chinese characters without simplification
Tokenization: Used jieba Chinese word segmentation
Cleaning: Removed English text, numbers, and special characters
Stop Words: Removed common Chinese stop words

Sample Statistics (1956)

Sentences: 1,708
Total Tokens: 36,845
Unique Vocabulary: 39,403 tokens

Model Training ▼

Word2Vec

Algorithm: CBOW (Continuous Bag of Words)
Vector Size: 100 dimensions
Window Size: 5
Min Count: 1

FastText

Algorithm: Skip-gram with subword information
Vector Size: 100 dimensions
Min n-gram: 3
Max n-gram: 6

BERT

Model: hfl/chinese-bert-wwm (Chinese BERT with Whole Word Masking)
Embedding Size: 768 dimensions
Layer: Pooled output from final layer

Dimensionality Reduction ▼

Word embeddings were reduced to 2D and 3D for visualization using three methods:

Method	Description
PCA	Principal Component Analysis - linear projection maximizing variance
t-SNE	t-distributed Stochastic Neighbor Embedding - nonlinear, preserves local structure
UMAP	Uniform Manifold Approximation and Projection - nonlinear, balances local/global structure

Similarity Computation ▼

Six similarity metrics were computed for the rationality concepts analysis:

Metric	Formula	Range
Cosine Similarity	(A·B) / (\|\|A\|\| × \|\|B\|\|)	[-1, 1]
Euclidean Distance	√(Σ(A-B)²)	[0, ∞)
Manhattan Distance	Σ\|Aᵢ - Bᵢ\|	[0, ∞)
Jaccard Index	\|A ∩ B\| / \|A ∪ B\|	[0, 1]
Pearson Correlation	Cov(A,B) / (σA × σB)	[-1, 1]
Spearman Correlation	Pearson on ranks	[-1, 1]

Data Formats

TXT Files (Corpus)

Original literary journal texts in the corpus/ folder.

Encoding: UTF-8
Format: Plain text, one file per issue
Language: Traditional Chinese characters

JSON Files (Model Data)

Processed data files in rationality-related/ and yearly-based-model-data/ folders.

Schema fields:

year: Publication year (1955-2021)
processed_tokens: List of segmented tokens
token_frequency: Top 50 most frequent tokens
sentence_count: Number of sentences
total_tokens: Total token count
unique_vocabulary_count: Vocabulary size
embeddings: Model-specific vector representations
model_type: "bert", "word2vec", or "fasttext"

How to Reuse This Dataset

Quick Start Guide ▼

For Literary Analysis: Use original text files in corpus/ folder, organized chronologically from 1955-1961
For NLP Research: Access processed embeddings in yearly-based-model-data/ with three model types
For Historical Studies: Focus on rationality-related analysis in the specialized rationality-related/ folders

Model Selection Guidelines ▼

Research Goal	Recommended Model	Rationale
Semantic similarity analysis	BERT	Contextual embeddings capture nuanced meanings
Cross-temporal comparison	Word2Vec	Consistent static embeddings across years
Morphological analysis	FastText	Subword information for Chinese character analysis
Historical language patterns	Word2Vec/FastText	Faster processing for large-scale analysis

Technical Implementation Example ▼

# Example: Loading preprocessed embeddings for 1956
import json

with open('yearly-based-model-data/1956/1956_model_data_bert.json', 'r') as f:
    data_1956 = json.load(f)
    
# Access tokenized text and embeddings
tokens = data_1956['text_processing']['processed_tokens']
embeddings = data_1956['models_data']['bert']['model_info']

Data Quality and Limitations

Known Limitations ▼

Data Gaps:

1958 Q4 second-half issues missing from source archive
1961 coverage limited to January issue only

Quality Notes:

OCR-processed texts with manual verification
Historical orthography preserved (no modernization applied)
Consistent preprocessing pipeline across all years

Citation

If you use this dataset in your research, please cite:

Wong, Nicholas Y. H., Candy Ye Tsz Yu, and Allie Xiang Haiyin. 
"DH Mahua Literary Journal Dataset: Word Embeddings for Malayan Chinese Literature (1955-1961)." 
[Dataset]. University of Hong Kong, 2026. 
DOI: 10.5281/zenodo.18205257. 
Funded by Hong Kong Research Grants Council ECS Grant No. 27609122.

For APA style:

Wong, N. Y. H., Ye Tsz Yu, C., & Xiang Haiyin, A. (2026). 
DH Mahua Literary Journal Dataset: Word Embeddings for Malayan Chinese Literature (1955-1961) 
[Dataset]. University of Hong Kong. https://doi.org/10.5281/zenodo.18205257. 
Funded by Hong Kong RGC ECS Grant No. 27609122.

Note: Each version has its own DOI. Please cite the specific version you used: v1.0.0, v1.1.0, v1.2.0, or v1.2.1.

Project Team

Project Author

Nicholas Y. H. Wong
Assistant Professor, School of Chinese
The University of Hong Kong

Email: nyhwong@hku.hk
Website: nyhwong.com
ORCID: 0000-0003-3953-5179

Contributors

Ye Tsz Yu Candy
Bachelor of Arts, Majors: Chinese Language and Literature, Computer Science
The University of Hong Kong

Email: u3607570@connect.hku.hk
Role: Visualization generation and dataset creation

Text Vetting

Allie Xiang Haiyin
Bachelor of Arts, Majors: Translation and Comparative Literature, Minor: Art History
The University of Hong Kong

Role: Text vetting and OCR validation

Acknowledgments

Funding: Research Grants Council (RGC) of Hong Kong SAR, China - Early Career Scheme (ECS) Grant No. 27609122 for "Visualizing Keywords in Malaysian-Chinese Literary History via Digital Humanities Methods"

Special Collection: This dataset supports the Journal of Open Humanities Data (JOHD) Special Collection on "Benchmarking in Digital Humanities" edited by Dr. Jenny C.Y. Kwok and Dr. Liam Jianliang Gao.

License and Permissions

This dataset is provided for academic and research purposes.

Commercial use or redistribution requires permission from the project author.

Please contact Nicholas Y. H. Wong for licensing inquiries.

Contact

For questions about the dataset:

Nicholas Y. H. Wong
Email: nyhwong@hku.hk

GitHub Repository:
github.com/candyyetszyu/DH_Dataset_Mahua_word_embeddings

Please open an issue on GitHub for dataset-related questions or submit a pull request for corrections.

Version History

Version	Date	DOI	Description
v1.2.1	January 10, 2026	10.5281/zenodo.18205257	Project Website Update
v1.2.0	January 10, 2026	10.5281/zenodo.18205166	Documentation Updates
v1.1.0	November 17, 2025	10.5281/zenodo.17627000	Word2vec Network Analysis for 1959_04_1_jf78
v1.0.0	November 17, 2025	10.5281/zenodo.17626427	DH Mahua Literary Journal Dataset - Initial release with yearly model data

All versions are archived on Zenodo. Please cite the specific version used in your research.