论文:2017,Vol:35,Issue(4):729-735
引用本文:
董洋溢, 李伟华, 于会. 文本特征和复合统计量的领域术语抽取方法[J]. 西北工业大学学报
Dong Yangyi, Li Weihua, Yu Hui. Domain Term Extraction Method Based on Hierarchical Combination Strategy for Chinese Web Documents[J]. Northwestern polytechnical university

文本特征和复合统计量的领域术语抽取方法
董洋溢, 李伟华, 于会
西北工业大学 计算机学院, 陕西 西安 710072
摘要:
中文领域术语的抽取,是文本知识挖掘的重要内容。传统的中文领域术语抽取方法以人工方法为主,显然这种方法费时费力。目前,处于研究阶段的中文领域术语自动化抽取方法主要有:基于字典的方法、基于规则的方法以及基于统计的方法。但由于中文自然语言的复杂性,这些自动化抽取方法都存在一定的局限性,比如对特定领域的用户字典及规则存在更新速度慢、文本特征考虑不足等,从而导致抽取的效果不佳。针对这一问题,提出了一种基于文本特征和复合统计量的中文领域术语抽取方法,该方法在对中文文档中的词语进行粗粒度筛选后,再综合考虑候选术语的词性、长度、边界词语等文本特征,构造出信息熵和TFIDF等统计量,计算其综合权值,并将综合权值大于设定阈值的候选术语抽取出来,作为最终的领域术语。实验结果表明,该方法在测试语料下,获得了较好的正确率、召回率和F值。
关键词:    中文领域术语    文本挖掘    自然语言处理    文本特征   
Domain Term Extraction Method Based on Hierarchical Combination Strategy for Chinese Web Documents
Dong Yangyi, Li Weihua, Yu Hui
School of Computer Science, Northwestern Polytechnical University, Xi'an 710072, China
Abstract:
Chinese domain term extraction is an important content of text knowledge mining. Chinese domain term extraction method with the traditional manual method, this method is time-consuming and laborious. It is currently in Chinese domain term extraction method of automation stage are: dictionary based method, rule-based method and statistical based method. Due to the complexity of Chinese natural language, the automatic extraction method has some limitations, such as the specific areas of the user dictionary and rule updating speed is slow, lack of consideration of text feature, which leads to the extraction performance is poor. To solve these problems, this paper presents Chinese domain term extraction methods that compound the text feature and statistics. After coarse grain screening of Chinese words in a document, the method considering the part of speech, word length, boundary text features of the candidate terms, construct information entropy and TFIDF statistics, calculate the comprehensive weight, and the weights are bigger than the set threshold extracted as the final domain terms. The experimental results show that the method gets the good correct rate, recall rate and F-measure under the test corpus.
Key words:    Chinese domain term    text mining    natural language processing    text feature   
收稿日期: 2016-09-25     修回日期:
DOI:
基金项目: 陕西省自然科学基金(2015JM6290)资助
通讯作者:     Email:
作者简介: 董洋溢(1978—),女,西北工业大学博士研究生,主要从事文本数据挖掘与智能决策研究。
相关功能
PDF(1043KB) Free
打印本文
把本文推荐给朋友
作者相关文章
董洋溢  在本刊中的所有文章
李伟华  在本刊中的所有文章
于会  在本刊中的所有文章

参考文献:
[1] 林源,陈志泊,孙俏. 计算机领域术语的自动获取与层次构建[J]. 计算机工程,2011(2): 172-174 Lin Yuan, Chen Zhibo, Sun Qiao. Automatic Extraction and Hierarchical Construction of Computer Domain Terms[J]. Computer Engineering, 2011(2): 172-174 (in Chinese)
[2] Maedche A, Staab S. Ontology Learning Handbook on Ontologies in Information System[M]. Heidelberg, Springer-Verlag, 2004: 173-190
[3] Frantzi K, Ananiadou S, Tsujii J. The C-Value/NC-Value Method of Automatic Recognition for Multi-Word Terms[J]. Journal of Natural Language Processing,1999,6(3):115-130
[4] Pantel P, Lin D. A Statistical Corpus-Based Term Extractor[C]//Conference of the Canadian Society for Computational Studies of Intelligence, 2001:36-46
[5] 贺海涛,郑山红,李万龙,等. 基于关联规则和语义规则的本体概念提取研究[J]. 吉林大学学报:信息科学版,2014,9(6):657-663 He Haitao, Zheng Shanhong, Li Wanlong, et al. Research on Ontology Concept Extraction Based on Association Rules and Semantic Rules[J]. Journal of Jilin University(Information Science Edition), 2014,9(6):657-663 (in Chinese)
[6] 向音,李苏鸣. 领域术语特征分析——以军语为例[J]. 中国科技术语, 2012, 14(5):5-9 Xiang Yin, Li Suming. Characteristic Analysis of Domain Terms——Case Study of Military Terms[J]. Chinese Terminology, 2012, 14(5): 5-9 (in Chinese)
[7] 贾文娟,何丰. 基于HowNet的中文本体学习方法研究[J]. 计算机技术与发展,2011,6(6):77-84 Jia Wenjuan, He Feng. Study on the Method for Chinese Ontology Learning Based on HowNet[J]. Computer Technology and Development, 2011,6(6): 77-84 (in Chinese)
[8] 袁劲松,张小明,李舟军. 术语自动抽取方法研究综述[J]. 计算机科学,2015,42(8):7-12 Yuan Jinsong, Zhang Xiaoming, Li Zhoujun. Research on Automatic Term Extraction Method[J]. Computer Science,2015,42(8):7-12 (in Chinese)
[9] 邢红兵. 信息领域汉英术语的特征及其在语料中的分布规律[J]. 术语标准化与信息技术,2000(3):17-21 Xing Hongbing. The Characteristics of the Field of Information and Englishterims in the Corpus Distribution[J]. Technology Standardization and Information Technology, 2000(3): 17-21 (in Chinese)
[10] 张榕. 术语定义抽取、聚类与术语识别研究[D]. 北京:北京语言大学,2006 Zhang Rong. Study on the Clustering of Term Definition Extraction and Term Recognition[D]. Beijing, Beijing Language and Culture University, 2006 (in Chinese)
[11] 李丽双. 领域本体学习中术语及关系抽取方法的研究[D]. 大连: 大连理工大学,2013 Li Lishuang. Study on Domain Ontology Terms and Relational Learning Methods[D]. Dalian, Dalian University of Technology, 2013 (in Chinese)
[12] 周浪,张亮,冯冲,黄河燕. 基于词频分布变化统计的术语抽取方法[J]. 计算机科学,2009,36(5):177-180 Zhou Lang, Zhang Liang, Feng Chong, Huang Heyan. Term Extraction Method Based on Statistical Word Frequency Distribution Variety[J]. Computer Science, 2009,36(5): 177-180 (in Chinese)