论文:2016,Vol:34,Issue(4):698-702
引用本文:
郭蓝天, 李扬, 慕德俊, 杨涛, 李哲. 一种基于LDA主题模型的话题发现方法[J]. 西北工业大学学报
Guo Lantian, Li Yang, Mu Dejun, Yang Tao, Li Zhe. A LDA Model Based Topic Detection Method[J]. Northwestern polytechnical university

一种基于LDA主题模型的话题发现方法
郭蓝天, 李扬, 慕德俊, 杨涛, 李哲
西北工业大学 自动化学院, 陕西 西安 710072
摘要:
话题发现是提取热点话题并掌握其演化规律的关键技术之一。针对社交网络中海量短文本信息具有高维性导致主题模型难以处理以及主题分布不均导致主题不明确的问题,提出一种基于LDA(latent dirichlet allocation)主题模型的CBOW-LDA主题建模方法,通过引入基于CBOW(continuous bag-of-word)模型的词向量化方法对目标语料进行相似词的聚类,能够有效降低LDA模型输入文本的维度,并且使主题更明确。通过在真实数据集上计算分析,与现有基于词频权重的词向量化LDA方法相比,在相同主题词数情况下困惑度可降低约3%。
关键词:    词向量    LDA模型    话题发现    困惑度   
A LDA Model Based Topic Detection Method
Guo Lantian, Li Yang, Mu Dejun, Yang Tao, Li Zhe
School of Automation, Northwestern Polytechnical University, Xi'an 710072, China
Abstract:
Topic Detection is one of the most important techniques in hot topic extraction and evolution tracking. Due to the high dimensionality problem which hinders processing efficiency and topics mal-distribution problem which makes topics unclear, it is difficult to detect topics from a large number of short texts in social network. To address these challenges, we proposed a new LDA (Latent Dirichlet Allocation) model based topic detection method called CBOW-LDA topic modeling method. It utilizes a CBOW(Continuous Bag-of-Word) method to cluster the words, which generate word vectors and clustering by vectors similarity. This method decreases the dimensions of LDA output, and makes topic more clearly. Through the analysis of topic perplexity in the real-world dataset, it is obvious that topics detected by our method has a lower perplexity, comparing with word frequency weighing based vectors. In a condition of same number of topic words, perplexity is reduced by about 3%.
Key words:    word vectors    LDA model    topic detection    perplexity   
收稿日期: 2016-03-19     修回日期:
DOI:
基金项目: 国家自然科学基金(61402373、61303224、61403311)与航空科学基金(20155553036、2013ZC53034)资助
通讯作者:     Email:
作者简介: 郭蓝天(1987-),西北工业大学博士研究生,主要从事数据挖掘及机器学习等研究。
相关功能
PDF(1171KB) Free
打印本文
把本文推荐给朋友
作者相关文章
郭蓝天  在本刊中的所有文章
李扬  在本刊中的所有文章
慕德俊  在本刊中的所有文章
杨涛  在本刊中的所有文章
李哲  在本刊中的所有文章

参考文献:
[1] Cheng Xueqi, Yan Xiaohui, Lan Yanyan, et al. BTM: Topic Modeling Over Short Texts[J]. IEEE Trans on Knowledge and Data Engineering, 2014, 26(12): 2928-2941
[2] Mikolow Tomas, Yih Wentau Scott, Zweiq Geoffery. Linguistic Reqularities in Contrmcous Space Word Representations[C]//Proceedings of the 12nd Conference of the North Anerican Chapter of the Association for Computational Linguistics, Atlanta, USA: NAACL, 2013
[3] Dermouche M, Velcin J, Khouas L, et al. A Joint Model for Topic-Sentiment Evolution Over Time[C]//Proceedings of 14th IEEE International Conference on Data Mining. Shenzhen, China, 2014
[4] Huang Bo, Yang Yan, Mahmood Amjad, et al. Microblog Topic Detection Based on LDA Model and Single-Pass Clustering[C]//Proceedings of 7th International Conference on Rough Sets and Current Trends in Computing. Chengdu, China, 2012
[5] Darling M William, Song Fei. Probabilistic Topic and Syntax Modeling with Part-of-Speech LDA[J]. ArXiv:1303.2826, 2013
[6] Bai Xue, Chen Fu, Zhan Shaobin. A New Clustering Model Based on Word2vec Mining on Sina Weibo Users' Tags[J]. International Journal of Grid Distribution Computing, 2014, 7(3): 41-48
[7] Zhou Xinjie, Wan Xiaojun, Xiao Jianguo. Repre-Sntation Learning for Aspect Category Detection in Online Reviews[C]. Proceedings of the 29th AAAI Conference on Artificial Intelligence. Austin, Texas, USA, 2015
[8] Mikolov Tomas, Sutskever Hya. Distributed Representutions of Words and Phrases and Their Compositionality[C]//Proceedings of the Ilth Newral Information Processing Systems Conference Lake Tahoe, USA: NIPS, 2013
[9] Cao Ziqiang, Li Sujian, Liu Yang, et al. A Novel Neural Topic Model and Its Supervised Extension[C]//Proceedings of the 29th AAAI Conference on Artificial Intelligence. Austin, Texas, USA, 2015