Mahua Word Embeddings
Digital Humanities Dataset of Chinese Word Embeddings from Malayan Chinese Literature (1955-1961)
About This Dataset
This dataset contains word embeddings trained on texts from Mahua (馬華) literature — literary works published in Malaya from 1955 to 1961. The corpus includes issues from the journal "蕉風" (Jiāo Fēng), representing a significant body of Chinese diaspora literature.
The project seeks to move beyond traditional frameworks—Malayan (national assimilation) and Chinese (diasporic or linguistic)—in analyzing the Mahua literary journal. Instead, it explores conceptual and thematic connections through digital humanities methods, engaging with broader literary and cultural debates.
The dataset provides multiple embedding models (Word2Vec, FastText, and BERT) for each year, enabling researchers to study semantic changes in the Mahua literary vocabulary over time.
Project Aim
A key research focus is the computational analysis of important concepts and keywords as they appear in various forms throughout the journal. The project leverages BERT and other word embedding models to better capture contextual meanings and semantic nuances across the literary corpus, enabling new insights into the evolution of ideas and themes over time.
Corpus Coverage
| Year | Issues | Characters | Notes |
|---|---|---|---|
| 1955 | 4 | 93,773 | Nov–Dec only |
| 1956 | 24 | 647,956 | Complete coverage |
| 1957 | 24 | 669,896 | Complete coverage |
| 1958 | 21 | 617,044 | Complete coverage; monthly publication from November 1958 onward |
| 1959 | 12 | 431,252 | Single monthly issues |
| 1960 | 12 | 386,075 | Single monthly issues |
| 1961 | 1 | 32,315 | Jan only |
Available Models
Explore the Dataset
Featured: Embedding Visualizations
Word embeddings are visualized using three dimensionality reduction methods: PCA (Principal Component Analysis), t-SNE (t-distributed Stochastic Neighbor Embedding), and UMAP (Uniform Manifold Approximation and Projection).
Visualizations are available for Word2Vec, FastText, and BERT models in both 2D and 3D formats.
Research Applications
- Diachronic Analysis: Track semantic shifts across the 6-year period using word embeddings
- Thematic Evolution: Compare BERT contextual embeddings across years for concept analysis
- Cross-temporal Comparison: Use consistent preprocessing for reliable year-to-year comparisons
- Keyword Networks: Utilize pre-computed embeddings for semantic similarity networks
- Topic Modeling: Apply to chronologically organized corpus for temporal topic analysis
- Stylometric Analysis: Compare linguistic features across different years and authors
- Post-colonial Literary History: Examine evolving themes in Malayan Chinese literature
- Cultural Identity: Analyze conceptual frameworks beyond traditional national/diasporic categories
- Comparative Literature: Position within broader Southeast Asian Chinese literary traditions
Citation
If you use this dataset in your research, please cite:
Wong, Nicholas Y. H., Candy Ye Tsz Yu, and Allie Xiang Haiyin.
"DH Mahua Literary Journal Dataset: Word Embeddings for Malayan Chinese Literature (1955-1961)."
[Dataset]. University of Hong Kong, 2026.
DOI: 10.5281/zenodo.18205257.
Funded by Hong Kong Research Grants Council ECS Grant No. 27609122.
Project Team
Nicholas Y. H. Wong
Assistant Professor, School of Chinese
The University of Hong Kong
Email: nyhwong@hku.hk
Website: nyhwong.com | ORCID: 0000-0003-3953-5179
Ye Tsz Yu Candy
Bachelor of Arts, Majors: Chinese Language and Literature, Computer Science
The University of Hong Kong
Role: Visualization generation and dataset creation
Allie Xiang Haiyin
Bachelor of Arts, Majors: Translation and Comparative Literature, Minor: Art History
The University of Hong Kong
Role: Text vetting and OCR validation
Acknowledgments
Funding: Research Grants Council (RGC) of Hong Kong SAR, China - Early Career Scheme (ECS) Grant No. 27609122 for "Visualizing Keywords in Malaysian-Chinese Literary History via Digital Humanities Methods"
Special Collection: This dataset supports the Journal of Open Humanities Data (JOHD) Special Collection on "Benchmarking in Digital Humanities" edited by Dr. Jenny C.Y. Kwok and Dr. Liam Jianliang Gao.
License
This dataset is provided for academic and research purposes. Commercial use or redistribution requires permission from the project author.