About This Dataset
Project Overview
The Mahua Word Embeddings dataset is a Digital Humanities resource containing word embeddings trained on Traditional Chinese texts from Mahua (馬華) literature — literary works published in Malaya from 1955 to 1961.
The corpus includes issues from the journal "蕉風" (Jiāo Fēng), representing a significant body of Chinese diaspora literature during the formative years of the Malayan Chinese community.
This dataset enables researchers to study the semantic evolution of Chinese diaspora literature, analyze the vocabulary of the Malayan Chinese community during the mid-20th century, and explore how concepts related to humanism and rationality were expressed in this unique literary tradition.
Project Aim
The project seeks to move beyond traditional frameworks—Malayan (national assimilation) and Chinese (diasporic or linguistic)—in analyzing the Mahua literary journal. Instead, it explores conceptual and thematic connections through digital humanities methods, engaging with broader literary and cultural debates.
A key research focus is the computational analysis of important concepts and keywords as they appear in various forms throughout the journal. The project leverages BERT and other word embedding models to better capture contextual meanings and semantic nuances across the literary corpus, enabling new insights into the evolution of ideas and themes over time.
Dataset Structure
This dataset is organized into three main components:
Contains the original text files organized by year (1955-1961), with each file representing
a specific half-month issue (1955–1958) or single monthly issue (November 1958 onward) following the
YYYY-MM-first|second-issue-XXX.txt convention.
Contains specialized folders for analysis of rationality-related content from specific issues
(1959_04_1_jf78 and 1959_05_1_jf79), retained for historical traceability.
Contains processed model data organized by year for embedding and analysis purposes.
Corpus Coverage
| Year | Issues | Character Count | Notes |
|---|---|---|---|
| 1955 | 4 | 93,773 | Nov–Dec only |
| 1956 | 24 | 647,956 | Complete coverage |
| 1957 | 24 | 669,896 | Complete coverage |
| 1958 | 21 | 617,044 | Complete coverage; monthly publication from November 1958 onward |
| 1959 | 12 | 431,252 | Single monthly issues |
| 1960 | 12 | 386,075 | Single monthly issues |
| 1961 | 1 | 32,315 | Jan only |
Total: 98 issues • 2,878,311 characters
Methodology
Data Formats
Original literary journal texts in the corpus/ folder.
- Encoding: UTF-8
- Format: Plain text, one file per issue
- Language: Traditional Chinese characters
Processed data files in rationality-related/ and yearly-based-model-data/ folders.
Schema fields:
year: Publication year (1955-2021)processed_tokens: List of segmented tokenstoken_frequency: Top 50 most frequent tokenssentence_count: Number of sentencestotal_tokens: Total token countunique_vocabulary_count: Vocabulary sizeembeddings: Model-specific vector representationsmodel_type: "bert", "word2vec", or "fasttext"
How to Reuse This Dataset
Data Quality and Limitations
Citation
If you use this dataset in your research, please cite:
Wong, Nicholas Y. H., Candy Ye Tsz Yu, and Allie Xiang Haiyin.
"DH Mahua Literary Journal Dataset: Word Embeddings for Malayan Chinese Literature (1955-1961)."
[Dataset]. University of Hong Kong, 2026.
DOI: 10.5281/zenodo.18205257.
Funded by Hong Kong Research Grants Council ECS Grant No. 27609122.
For APA style:
Wong, N. Y. H., Ye Tsz Yu, C., & Xiang Haiyin, A. (2026).
DH Mahua Literary Journal Dataset: Word Embeddings for Malayan Chinese Literature (1955-1961)
[Dataset]. University of Hong Kong. https://doi.org/10.5281/zenodo.18205257.
Funded by Hong Kong RGC ECS Grant No. 27609122.
Note: Each version has its own DOI. Please cite the specific version you used: v1.0.0, v1.1.0, v1.2.0, or v1.2.1.
Project Team
Nicholas Y. H. Wong
Assistant Professor, School of Chinese
The University of Hong Kong
Email: nyhwong@hku.hk
Website: nyhwong.com
ORCID: 0000-0003-3953-5179
Ye Tsz Yu Candy
Bachelor of Arts, Majors: Chinese Language and Literature, Computer Science
The University of Hong Kong
Email: u3607570@connect.hku.hk
Role: Visualization generation and dataset creation
Allie Xiang Haiyin
Bachelor of Arts, Majors: Translation and Comparative Literature, Minor: Art History
The University of Hong Kong
Role: Text vetting and OCR validation
Acknowledgments
Funding: Research Grants Council (RGC) of Hong Kong SAR, China - Early Career Scheme (ECS) Grant No. 27609122 for "Visualizing Keywords in Malaysian-Chinese Literary History via Digital Humanities Methods"
Special Collection: This dataset supports the Journal of Open Humanities Data (JOHD) Special Collection on "Benchmarking in Digital Humanities" edited by Dr. Jenny C.Y. Kwok and Dr. Liam Jianliang Gao.
License and Permissions
This dataset is provided for academic and research purposes.
Commercial use or redistribution requires permission from the project author.
Please contact Nicholas Y. H. Wong for licensing inquiries.
Contact
For questions about the dataset:
Nicholas Y. H. Wong
Email: nyhwong@hku.hk
GitHub Repository:
github.com/candyyetszyu/DH_Dataset_Mahua_word_embeddings
Please open an issue on GitHub for dataset-related questions or submit a pull request for corrections.
Version History
| Version | Date | DOI | Description |
|---|---|---|---|
| v1.2.1 | January 10, 2026 | 10.5281/zenodo.18205257 | Project Website Update |
| v1.2.0 | January 10, 2026 | 10.5281/zenodo.18205166 | Documentation Updates |
| v1.1.0 | November 17, 2025 | 10.5281/zenodo.17627000 | Word2vec Network Analysis for 1959_04_1_jf78 |
| v1.0.0 | November 17, 2025 | 10.5281/zenodo.17626427 | DH Mahua Literary Journal Dataset - Initial release with yearly model data |
All versions are archived on Zenodo. Please cite the specific version used in your research.