Paper
7 December 2023 A new method for discovering domain specific words in Chinese texts based on machine learning
Yan Zhang
Author Affiliations +
Proceedings Volume 12941, International Conference on Algorithms, High Performance Computing, and Artificial Intelligence (AHPCAI 2023); 1294134 (2023) https://doi.org/10.1117/12.3011472
Event: Third International Conference on Algorithms, High Performance Computing, and Artificial Intelligence (AHPCAI 203), 2023, Yinchuan, China
Abstract
Although there are many differences in writing and vocabulary between the late Qing Dynasty to the early Republic of China (QDRC) of Chinese and modern Chinese, there is not much difference in grammar, semantics, etc. It is a transitional stage between ancient Chinese and modern Chinese. We seem more willing to refer to this stage of Chinese as the pre modern Chinese period. This paper refers to the method of discovering new words and phrases in the natural language processing (NLP) field, aiming to explore a new method of discovering domain specific words (DSW) in Chinese texts from the QDRC period. Through the comprehensive application of corpus linguistics and computational linguistics, we have successfully constructed a multi-level analytical framework to reveal the unique lexical characteristics of Chinese texts in this period. First, this paper establishes a small Chinese corpus at the QDRC period, and uses bigram and trigram of N-gram to preprocess the texts, including word segmentation, parts of speech tagging, etc. The word frequency, pointwise mutual information, left and right entropy of bigram and trigram fragments were calculated, and 909 key words with special status in social, political, cultural and other fields in the QDRC were finally identified through manual proofreading. This article validates the effectiveness of the proposed method and discovers some distinctive Chinese words in the QDRC that have not been previously studied in the experiments.
(2023) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Yan Zhang "A new method for discovering domain specific words in Chinese texts based on machine learning", Proc. SPIE 12941, International Conference on Algorithms, High Performance Computing, and Artificial Intelligence (AHPCAI 2023), 1294134 (7 December 2023); https://doi.org/10.1117/12.3011472
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Machine learning

Education and training

Standards development

Data modeling

Associative arrays

Semantics

Design and modelling

Back to Top