Mahua Word Embeddings

Digital Humanities Dataset of Chinese Word Embeddings from Malayan Chinese Literature (1955-1961)

7 Years

98 Issues

3 Models

2.8M+ Characters

About This Dataset

This dataset contains word embeddings trained on texts from Mahua (馬華) literature — literary works published in Malaya from 1955 to 1961. The corpus includes issues from the journal "蕉風" (Jiāo Fēng), representing a significant body of Chinese diaspora literature.

The project seeks to move beyond traditional frameworks—Malayan (national assimilation) and Chinese (diasporic or linguistic)—in analyzing the Mahua literary journal. Instead, it explores conceptual and thematic connections through digital humanities methods, engaging with broader literary and cultural debates.

The dataset provides multiple embedding models (Word2Vec, FastText, and BERT) for each year, enabling researchers to study semantic changes in the Mahua literary vocabulary over time.

Project Aim

A key research focus is the computational analysis of important concepts and keywords as they appear in various forms throughout the journal. The project leverages BERT and other word embedding models to better capture contextual meanings and semantic nuances across the literary corpus, enabling new insights into the evolution of ideas and themes over time.

Corpus Coverage

Year	Issues	Characters	Notes
1955	4	93,773	Nov–Dec only
1956	24	647,956	Complete coverage
1957	24	669,896	Complete coverage
1958	21	617,044	Complete coverage; monthly publication from November 1958 onward
1959	12	431,252	Single monthly issues
1960	12	386,075	Single monthly issues
1961	1	32,315	Jan only

Available Models

Word2Vec CBOW • 100 dims

FastText Subword • 100 dims

BERT Chinese-BERT • 768 dims

Explore the Dataset

📚 By Year (1955-1961) ▼

View word embedding statistics and model data for each year:

1955 1956 1957 1958 1959 1960 1961

🔬 Rationality Concepts Analysis ▼

Explore semantic networks of 7 rationality-related concepts using Word2Vec:

人

Human

人文

Humanities

人文主義

Humanism

人本

Human-centric

人本主義

Anthropocentrism

人道

Humanitarianism

人道主義

Humanism

View Full Analysis →

📁 Corpus Information ▼

Browse the original text corpus organized by year and issue:

98 text files containing Traditional Chinese texts from Mahua literature (1955-1961).

Browse Corpus →

⬇️ Download Data ▼

Download the complete dataset or individual components:

Complete Dataset (ZIP)

All files • ~20 MB

View Options

All Download Options →

Featured: Embedding Visualizations

2D and 3D Dimensionality Reduction

Word embeddings are visualized using three dimensionality reduction methods: PCA (Principal Component Analysis), t-SNE (t-distributed Stochastic Neighbor Embedding), and UMAP (Uniform Manifold Approximation and Projection).

Visualizations are available for Word2Vec, FastText, and BERT models in both 2D and 3D formats.

Research Applications

Computational Literary Studies

Diachronic Analysis: Track semantic shifts across the 6-year period using word embeddings
Thematic Evolution: Compare BERT contextual embeddings across years for concept analysis
Cross-temporal Comparison: Use consistent preprocessing for reliable year-to-year comparisons

Digital Humanities Methods

Keyword Networks: Utilize pre-computed embeddings for semantic similarity networks
Topic Modeling: Apply to chronologically organized corpus for temporal topic analysis
Stylometric Analysis: Compare linguistic features across different years and authors

Historical and Cultural Studies

Post-colonial Literary History: Examine evolving themes in Malayan Chinese literature
Cultural Identity: Analyze conceptual frameworks beyond traditional national/diasporic categories
Comparative Literature: Position within broader Southeast Asian Chinese literary traditions

Citation

If you use this dataset in your research, please cite:

Wong, Nicholas Y. H., Candy Ye Tsz Yu, and Allie Xiang Haiyin. 
"DH Mahua Literary Journal Dataset: Word Embeddings for Malayan Chinese Literature (1955-1961)." 
[Dataset]. University of Hong Kong, 2026. 
DOI: 10.5281/zenodo.18205257. 
Funded by Hong Kong Research Grants Council ECS Grant No. 27609122.

View all versions: v1.0.0, v1.1.0, v1.2.0, v1.2.1

Project Team

Project Author

Nicholas Y. H. Wong
Assistant Professor, School of Chinese
The University of Hong Kong

Email: nyhwong@hku.hk
Website: nyhwong.com | ORCID: 0000-0003-3953-5179

Contributors

Ye Tsz Yu Candy
Bachelor of Arts, Majors: Chinese Language and Literature, Computer Science
The University of Hong Kong

Role: Visualization generation and dataset creation

Text Vetting

Allie Xiang Haiyin
Bachelor of Arts, Majors: Translation and Comparative Literature, Minor: Art History
The University of Hong Kong

Role: Text vetting and OCR validation

Acknowledgments

Funding: Research Grants Council (RGC) of Hong Kong SAR, China - Early Career Scheme (ECS) Grant No. 27609122 for "Visualizing Keywords in Malaysian-Chinese Literary History via Digital Humanities Methods"

Special Collection: This dataset supports the Journal of Open Humanities Data (JOHD) Special Collection on "Benchmarking in Digital Humanities" edited by Dr. Jenny C.Y. Kwok and Dr. Liam Jianliang Gao.

License

This dataset is provided for academic and research purposes. Commercial use or redistribution requires permission from the project author.