Mahua Word Embeddings

Digital Humanities Dataset of Chinese Word Embeddings from Malayan Chinese Literature (1955-1961)

7 Years
98 Issues
3 Models
2.8M+ Characters

About This Dataset

This dataset contains word embeddings trained on texts from Mahua (馬華) literature — literary works published in Malaya from 1955 to 1961. The corpus includes issues from the journal "蕉風" (Jiāo Fēng), representing a significant body of Chinese diaspora literature.

The project seeks to move beyond traditional frameworks—Malayan (national assimilation) and Chinese (diasporic or linguistic)—in analyzing the Mahua literary journal. Instead, it explores conceptual and thematic connections through digital humanities methods, engaging with broader literary and cultural debates.

The dataset provides multiple embedding models (Word2Vec, FastText, and BERT) for each year, enabling researchers to study semantic changes in the Mahua literary vocabulary over time.

Project Aim

A key research focus is the computational analysis of important concepts and keywords as they appear in various forms throughout the journal. The project leverages BERT and other word embedding models to better capture contextual meanings and semantic nuances across the literary corpus, enabling new insights into the evolution of ideas and themes over time.

Corpus Coverage

Year Issues Characters Notes
1955493,773Nov–Dec only
195624647,956Complete coverage
195724669,896Complete coverage
195821617,044Complete coverage; monthly publication from November 1958 onward
195912431,252Single monthly issues
196012386,075Single monthly issues
1961132,315Jan only

Available Models

Word2Vec CBOW • 100 dims
FastText Subword • 100 dims
BERT Chinese-BERT • 768 dims

Explore the Dataset

Featured: Embedding Visualizations

2D and 3D Dimensionality Reduction

Word embeddings are visualized using three dimensionality reduction methods: PCA (Principal Component Analysis), t-SNE (t-distributed Stochastic Neighbor Embedding), and UMAP (Uniform Manifold Approximation and Projection).

Visualizations are available for Word2Vec, FastText, and BERT models in both 2D and 3D formats.

Research Applications

Computational Literary Studies
  • Diachronic Analysis: Track semantic shifts across the 6-year period using word embeddings
  • Thematic Evolution: Compare BERT contextual embeddings across years for concept analysis
  • Cross-temporal Comparison: Use consistent preprocessing for reliable year-to-year comparisons
Digital Humanities Methods
  • Keyword Networks: Utilize pre-computed embeddings for semantic similarity networks
  • Topic Modeling: Apply to chronologically organized corpus for temporal topic analysis
  • Stylometric Analysis: Compare linguistic features across different years and authors
Historical and Cultural Studies
  • Post-colonial Literary History: Examine evolving themes in Malayan Chinese literature
  • Cultural Identity: Analyze conceptual frameworks beyond traditional national/diasporic categories
  • Comparative Literature: Position within broader Southeast Asian Chinese literary traditions

Citation

If you use this dataset in your research, please cite:

Wong, Nicholas Y. H., Candy Ye Tsz Yu, and Allie Xiang Haiyin. 
"DH Mahua Literary Journal Dataset: Word Embeddings for Malayan Chinese Literature (1955-1961)." 
[Dataset]. University of Hong Kong, 2026. 
DOI: 10.5281/zenodo.18205257. 
Funded by Hong Kong Research Grants Council ECS Grant No. 27609122.

View all versions: v1.0.0, v1.1.0, v1.2.0, v1.2.1

Project Team

Project Author

Nicholas Y. H. Wong
Assistant Professor, School of Chinese
The University of Hong Kong

Email: nyhwong@hku.hk
Website: nyhwong.com | ORCID: 0000-0003-3953-5179

Contributors

Ye Tsz Yu Candy
Bachelor of Arts, Majors: Chinese Language and Literature, Computer Science
The University of Hong Kong

Role: Visualization generation and dataset creation

Text Vetting

Allie Xiang Haiyin
Bachelor of Arts, Majors: Translation and Comparative Literature, Minor: Art History
The University of Hong Kong

Role: Text vetting and OCR validation

Acknowledgments

Funding: Research Grants Council (RGC) of Hong Kong SAR, China - Early Career Scheme (ECS) Grant No. 27609122 for "Visualizing Keywords in Malaysian-Chinese Literary History via Digital Humanities Methods"

Special Collection: This dataset supports the Journal of Open Humanities Data (JOHD) Special Collection on "Benchmarking in Digital Humanities" edited by Dr. Jenny C.Y. Kwok and Dr. Liam Jianliang Gao.

License

This dataset is provided for academic and research purposes. Commercial use or redistribution requires permission from the project author.