A Semantic-based Feature Extraction Method Using Categorical Clustering for Persian Document Classification

Saeedeh Davoudi, Sayeh Mirzaei

Abstract

Natural Language Processing (NLP) is one of the promising ﬁelds of artiﬁcial intelligence. Recently, a high volume of text data has been generated through the Internet. This kind of data is a valuable source of information that can be used in various ﬁelds such as information retrieval, recommender systems, etc. One practical task of text mining is document classiﬁcation. In this paper, we mainly focus on Persian document classiﬁcation. We introduce a new feature extraction approach derived from the combination of K-means clustering and Word2Vec to acquire semantically relevant and discriminant word representations. We call our proposed approach CC-Word2Vec (Categorical Clustering-Word2Vec) and use different classification models to compare the performance of our approach with other techniques like Term Frequency Inverse Document Frequency (TF-IDF), Word2Vec, and Latent Dirichlet Allocation (LDA) methods. Our proposed method resulted in an improvement in the obtained accuracy of all classifiers in comparison with other techniques.

Keywords

Persian document classiﬁcation, TF-IDF, Word2Vec, CC-Word2Vec, MLP, GB, LDA, K-Means

References

[1] M. Farhoodi and A. Yari, "Applying machine learning algorithms for automatic Persian text classification," 2010 6th International Conference on Advanced Information Management and Service (IMS), Seoul, 2010, pp. 318-323.
[2] S. Zobeidi, M. Naderan, and S. E. Alavi, ”Effective text classiﬁcation using multi-level fuzzy neural network,” 2017 5th Iranian Joint Congress on Fuzzy and Intelligent Systems (CFIS), Qazvin, 2017, pp. 91-96.
[3] Hu, Xia, and Huan Liu. "Text analytics in social media." In Mining text data, pp. 385-414. Springer, Boston, MA, 2012.
[4] Ayoub Bagheri, Hamed Farzanehfar, Mohammad Hossein Saraee, Mohammad Reza Ahmadzadeh, The Farsi text classiﬁcation using Bayesian Algorithm. Second Iranian Conference on Data Mining of Iran, 2008.
[5] Bina, B., M. H. Ahmadi, M. Rahgozar, “Farsi Text Classiﬁcation Using N-Grams and Knn Algorithm A Comparative Study.” DMIN (2008).
[6] Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1 (March 2002), 1-47.
[7] S. Z. Mishu and S. M. Raﬁuddin, ”Performance analysis of supervised machine learning algorithms for text classiﬁcation,” 2016 19th International Conference on Computer and Information Technology (ICCIT), Dhaka, 2016, pp. 409-413.
[8] F. Alzamzami, M. Hoda and A. E. Saddik, ”Light Gradient Boosting Machine for General Sentiment Classiﬁcation on Short Texts: A Comparative Evaluation,” in IEEE Access, vol. 8, pp. 101840-101858, 2020.
[9] Resham N. Waykole, Anuradha D. Thakare. A review of feature extraction methods for text classification. International Journal of Advance Engineering and Research Development Volume 5, Issue 04, April -2018
[10] Jahantigh, Morteza, Negin Daneshpour, Mohammad Erfani, and Nargess Orojlou. "Presenting an improved combination for classification of Persian texts." In 2016 Eighth International Conference on Information and Knowledge Technology (IKT), pp. 234-240. IEEE, 2016.
[11] S. Ghasemi and A. H. Jadidinejad, ”Persian text classiﬁcation via character-level convolutional neural networks,” 2018 8th Conference of AI and Robotics and 10th RoboCup Iran Open International Symposium (IRANOPEN), Qazvin, 2018, pp. 1-6.
[12] N Rezaeian, G Novikova, Persian Text Classiﬁcation using naive Bayes algorithms and Support Vector Machine algorithm, Indonesian Journal of Electrical Engineering and Informatics (IJEEI), Vol. 8, No. 1, March 2020, pp. 178-188.
[13] S. E. Rad, and A. R. Behjat, ”Document Classiﬁcation base on Ensemble Classiﬁers Support Vector Machine, Multi-layer Perceptron and k-Nearest Neighbors.” J. Biochem. Tech., vol. 2, pp. 174-182, Sep. 2019.
[14] Ashkan, Jafari, Ezadi Hamed, Hossennejad Mihan, and Noohi Taher. "Improvement in automatic classification of Persian documents by means of support vector machine and representative vector." In International Conference on Innovative Computing Technology, pp. 282-292. Springer, Berlin, Heidelberg, 2011.
[15] P. Ahmadi, M. Tabandeh and I. Gholampour, ”Persian text classiﬁcation based on topic models,” 2016 24th Iranian Conference on Electrical Engineering (ICEE), Shiraz, 2016, pp. 86-91.
[16] Mikolov, Tomas, Kai Chen, G. S. Corrado and J. Dean. “Efﬁcient Estimation of Word Representations in Vector Space”. ICLR (2013).
[17] Wang, Zhibo, Long Ma, and Yanqing Zhang. "A hybrid document feature extraction method using latent Dirichlet allocation and word2vec." In 2016 IEEE first international conference on data science in cyberspace (DSC), pp. 98-103. IEEE, 2016.
[18] Rehurek, Radim, and Petr Sojka, “Software Framework for Topic Modelling with Large Corpora.” In Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks. 2010.
[19] Qaiser, Shahzad, and Ramsha Ali. “Text mining: Use of TF-IDF to Examine the Relevance of Words to Documents.” International Journal of Computer Applications 181 (2018): 25-29.
[20] Pedregosa, Fabian, et al. “Scikit-learn: Machine learning in Python.” the Journal of Machine Learning Research 12 (2011): 2825-2830.
[21] Peng, Min, Chongyang Wang, Tong Chen, Guangyuan Liu, and Xiaolan Fu. "Dual temporal scale convolutional neural network for micro-expression recognition." Frontiers in psychology 8 (2017): 1745.
[22] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3, null (3/1/2003), 993-1022.
[23] https://engineering.linkedin.com/blog/2020/open-sourcing-detext.
[24] https://www.sobhe.ir/hazm/
[25] AleAhmad, Abolfazl, et al. “Hamshahri: A standard Persian text collection.” Knowledge-Based Systems 22.5 (2009): 382-387.

Please sign in

The CSI Journal on Computer Science and Engineering

A Semantic-based Feature Extraction Method Using Categorical Clustering for Persian Document Classification