|
|
论文:2016,Vol:34,Issue(4):698-702 |
|
|
引用本文: |
|
|
郭蓝天, 李扬, 慕德俊, 杨涛, 李哲. 一种基于LDA主题模型的话题发现方法[J]. 西北工业大学学报 |
|
|
Guo Lantian, Li Yang, Mu Dejun, Yang Tao, Li Zhe. A LDA Model Based Topic Detection Method[J]. Northwestern polytechnical university |
|
|
|
|
|
|
|
一种基于LDA主题模型的话题发现方法 |
|
郭蓝天, 李扬, 慕德俊, 杨涛, 李哲 |
|
西北工业大学 自动化学院, 陕西 西安 710072 |
摘要: |
话题发现是提取热点话题并掌握其演化规律的关键技术之一。针对社交网络中海量短文本信息具有高维性导致主题模型难以处理以及主题分布不均导致主题不明确的问题,提出一种基于LDA(latent dirichlet allocation)主题模型的CBOW-LDA主题建模方法,通过引入基于CBOW(continuous bag-of-word)模型的词向量化方法对目标语料进行相似词的聚类,能够有效降低LDA模型输入文本的维度,并且使主题更明确。通过在真实数据集上计算分析,与现有基于词频权重的词向量化LDA方法相比,在相同主题词数情况下困惑度可降低约3%。 |
关键词:
词向量
LDA模型
话题发现
困惑度
|
|
A LDA Model Based Topic Detection Method |
|
Guo Lantian, Li Yang, Mu Dejun, Yang Tao, Li Zhe |
|
School of Automation, Northwestern Polytechnical University, Xi'an 710072, China |
Abstract: |
Topic Detection is one of the most important techniques in hot topic extraction and evolution tracking. Due to the high dimensionality problem which hinders processing efficiency and topics mal-distribution problem which makes topics unclear, it is difficult to detect topics from a large number of short texts in social network. To address these challenges, we proposed a new LDA (Latent Dirichlet Allocation) model based topic detection method called CBOW-LDA topic modeling method. It utilizes a CBOW(Continuous Bag-of-Word) method to cluster the words, which generate word vectors and clustering by vectors similarity. This method decreases the dimensions of LDA output, and makes topic more clearly. Through the analysis of topic perplexity in the real-world dataset, it is obvious that topics detected by our method has a lower perplexity, comparing with word frequency weighing based vectors. In a condition of same number of topic words, perplexity is reduced by about 3%. |
Key words:
word vectors
LDA model
topic detection
perplexity
|
|
收稿日期: 2016-03-19
修回日期:
|
DOI: |
基金项目: 国家自然科学基金(61402373、61303224、61403311)与航空科学基金(20155553036、2013ZC53034)资助 |
通讯作者:
Email: |
作者简介: 郭蓝天(1987-),西北工业大学博士研究生,主要从事数据挖掘及机器学习等研究。
|
|
相关功能 |
|
|
|
作者相关文章 |
|
郭蓝天 在本刊中的所有文章 |
李扬 在本刊中的所有文章 |
慕德俊 在本刊中的所有文章 |
杨涛 在本刊中的所有文章 |
李哲 在本刊中的所有文章 |
|
|
|
|
|
|
|
|
参考文献: |
|
|
[1] Cheng Xueqi, Yan Xiaohui, Lan Yanyan, et al. BTM: Topic Modeling Over Short Texts[J]. IEEE Trans on Knowledge and Data Engineering, 2014, 26(12): 2928-2941 [2] Mikolow Tomas, Yih Wentau Scott, Zweiq Geoffery. Linguistic Reqularities in Contrmcous Space Word Representations[C]//Proceedings of the 12nd Conference of the North Anerican Chapter of the Association for Computational Linguistics, Atlanta, USA: NAACL, 2013 [3] Dermouche M, Velcin J, Khouas L, et al. A Joint Model for Topic-Sentiment Evolution Over Time[C]//Proceedings of 14th IEEE International Conference on Data Mining. Shenzhen, China, 2014 [4] Huang Bo, Yang Yan, Mahmood Amjad, et al. Microblog Topic Detection Based on LDA Model and Single-Pass Clustering[C]//Proceedings of 7th International Conference on Rough Sets and Current Trends in Computing. Chengdu, China, 2012 [5] Darling M William, Song Fei. Probabilistic Topic and Syntax Modeling with Part-of-Speech LDA[J]. ArXiv:1303.2826, 2013 [6] Bai Xue, Chen Fu, Zhan Shaobin. A New Clustering Model Based on Word2vec Mining on Sina Weibo Users' Tags[J]. International Journal of Grid Distribution Computing, 2014, 7(3): 41-48 [7] Zhou Xinjie, Wan Xiaojun, Xiao Jianguo. Repre-Sntation Learning for Aspect Category Detection in Online Reviews[C]. Proceedings of the 29th AAAI Conference on Artificial Intelligence. Austin, Texas, USA, 2015 [8] Mikolov Tomas, Sutskever Hya. Distributed Representutions of Words and Phrases and Their Compositionality[C]//Proceedings of the Ilth Newral Information Processing Systems Conference Lake Tahoe, USA: NIPS, 2013 [9] Cao Ziqiang, Li Sujian, Liu Yang, et al. A Novel Neural Topic Model and Its Supervised Extension[C]//Proceedings of the 29th AAAI Conference on Artificial Intelligence. Austin, Texas, USA, 2015 |
|
|
|
|
|
|
|