About This Dataset

Project Overview

The Mahua Word Embeddings dataset is a Digital Humanities resource containing word embeddings trained on Traditional Chinese texts from Mahua (馬華) literature — literary works published in Malaya from 1955 to 1961.

The corpus includes issues from the journal "蕉風" (Jiāo Fēng), representing a significant body of Chinese diaspora literature during the formative years of the Malayan Chinese community.

This dataset enables researchers to study the semantic evolution of Chinese diaspora literature, analyze the vocabulary of the Malayan Chinese community during the mid-20th century, and explore how concepts related to humanism and rationality were expressed in this unique literary tradition.

Project Aim

The project seeks to move beyond traditional frameworks—Malayan (national assimilation) and Chinese (diasporic or linguistic)—in analyzing the Mahua literary journal. Instead, it explores conceptual and thematic connections through digital humanities methods, engaging with broader literary and cultural debates.

A key research focus is the computational analysis of important concepts and keywords as they appear in various forms throughout the journal. The project leverages BERT and other word embedding models to better capture contextual meanings and semantic nuances across the literary corpus, enabling new insights into the evolution of ideas and themes over time.

Dataset Structure

This dataset is organized into three main components:

corpus/

Contains the original text files organized by year (1955-1961), with each file representing a specific half-month issue (1955–1958) or single monthly issue (November 1958 onward) following the YYYY-MM-first|second-issue-XXX.txt convention.

rationality-related/

Contains specialized folders for analysis of rationality-related content from specific issues (1959_04_1_jf78 and 1959_05_1_jf79), retained for historical traceability.

yearly-based-model-data/

Contains processed model data organized by year for embedding and analysis purposes.

Corpus Coverage

Year Issues Character Count Notes
1955493,773Nov–Dec only
195624647,956Complete coverage
195724669,896Complete coverage
195821617,044Complete coverage; monthly publication from November 1958 onward
195912431,252Single monthly issues
196012386,075Single monthly issues
1961132,315Jan only

Total: 98 issues • 2,878,311 characters

Methodology

Data Formats

TXT Files (Corpus)

Original literary journal texts in the corpus/ folder.

  • Encoding: UTF-8
  • Format: Plain text, one file per issue
  • Language: Traditional Chinese characters
JSON Files (Model Data)

Processed data files in rationality-related/ and yearly-based-model-data/ folders.

Schema fields:

  • year: Publication year (1955-2021)
  • processed_tokens: List of segmented tokens
  • token_frequency: Top 50 most frequent tokens
  • sentence_count: Number of sentences
  • total_tokens: Total token count
  • unique_vocabulary_count: Vocabulary size
  • embeddings: Model-specific vector representations
  • model_type: "bert", "word2vec", or "fasttext"

How to Reuse This Dataset

Data Quality and Limitations

Citation

If you use this dataset in your research, please cite:

Wong, Nicholas Y. H., Candy Ye Tsz Yu, and Allie Xiang Haiyin. 
"DH Mahua Literary Journal Dataset: Word Embeddings for Malayan Chinese Literature (1955-1961)." 
[Dataset]. University of Hong Kong, 2026. 
DOI: 10.5281/zenodo.18205257. 
Funded by Hong Kong Research Grants Council ECS Grant No. 27609122.

For APA style:

Wong, N. Y. H., Ye Tsz Yu, C., & Xiang Haiyin, A. (2026). 
DH Mahua Literary Journal Dataset: Word Embeddings for Malayan Chinese Literature (1955-1961) 
[Dataset]. University of Hong Kong. https://doi.org/10.5281/zenodo.18205257. 
Funded by Hong Kong RGC ECS Grant No. 27609122.

Note: Each version has its own DOI. Please cite the specific version you used: v1.0.0, v1.1.0, v1.2.0, or v1.2.1.

Project Team

Project Author

Nicholas Y. H. Wong
Assistant Professor, School of Chinese
The University of Hong Kong

Email: nyhwong@hku.hk
Website: nyhwong.com
ORCID: 0000-0003-3953-5179

Contributors

Ye Tsz Yu Candy
Bachelor of Arts, Majors: Chinese Language and Literature, Computer Science
The University of Hong Kong

Email: u3607570@connect.hku.hk
Role: Visualization generation and dataset creation

Text Vetting

Allie Xiang Haiyin
Bachelor of Arts, Majors: Translation and Comparative Literature, Minor: Art History
The University of Hong Kong

Role: Text vetting and OCR validation

Acknowledgments

Funding: Research Grants Council (RGC) of Hong Kong SAR, China - Early Career Scheme (ECS) Grant No. 27609122 for "Visualizing Keywords in Malaysian-Chinese Literary History via Digital Humanities Methods"

Special Collection: This dataset supports the Journal of Open Humanities Data (JOHD) Special Collection on "Benchmarking in Digital Humanities" edited by Dr. Jenny C.Y. Kwok and Dr. Liam Jianliang Gao.

License and Permissions

This dataset is provided for academic and research purposes.

Commercial use or redistribution requires permission from the project author.

Please contact Nicholas Y. H. Wong for licensing inquiries.

Contact

For questions about the dataset:

Nicholas Y. H. Wong
Email: nyhwong@hku.hk

GitHub Repository:
github.com/candyyetszyu/DH_Dataset_Mahua_word_embeddings

Please open an issue on GitHub for dataset-related questions or submit a pull request for corrections.

Version History

Version Date DOI Description
v1.2.1 January 10, 2026 10.5281/zenodo.18205257 Project Website Update
v1.2.0 January 10, 2026 10.5281/zenodo.18205166 Documentation Updates
v1.1.0 November 17, 2025 10.5281/zenodo.17627000 Word2vec Network Analysis for 1959_04_1_jf78
v1.0.0 November 17, 2025 10.5281/zenodo.17626427 DH Mahua Literary Journal Dataset - Initial release with yearly model data

All versions are archived on Zenodo. Please cite the specific version used in your research.