An enhanced sequential exception technique for semantic-based text anomaly detection

The detection of semantic-based text anomaly is an interesting research area which has gained considerable attention from the data mining community. Text anomaly detection identifies deviating information from general information contained in documents. Text data are characterized by having problems...

Full description

Saved in:
Bibliographic Details
Main Author: Taiye, Mohammed Ahmed
Format: Thesis
Language:eng
eng
Published: 2019
Subjects:
Online Access:https://etd.uum.edu.my/8112/1/s900757_01.pdf
https://etd.uum.edu.my/8112/2/s900757_02.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-uum-etd.8112
record_format uketd_dc
institution Universiti Utara Malaysia
collection UUM ETD
language eng
eng
advisor Kamaruddin, Siti Sakira
Kabir Ahmad, Farzana
topic T58.5-58.64 Information technology
T58.5-58.64 Information technology
spellingShingle T58.5-58.64 Information technology
T58.5-58.64 Information technology
Taiye, Mohammed Ahmed
An enhanced sequential exception technique for semantic-based text anomaly detection
description The detection of semantic-based text anomaly is an interesting research area which has gained considerable attention from the data mining community. Text anomaly detection identifies deviating information from general information contained in documents. Text data are characterized by having problems related to ambiguity, high dimensionality, sparsity and text representation. If these challenges are not properly resolved, identifying semantic-based text anomaly will be less accurate. This study proposes an Enhanced Sequential Exception Technique (ESET) to detect semantic-based text anomaly by achieving five objectives: (1) to modify Sequential Exception Technique (SET) in processing unstructured text; (2) to optimize Cosine Similarity for identifying similar and dissimilar text data; (3) to hybridize modified SET with Latent Semantic Analysis (LSA); (4) to integrate Lesk and Selectional Preference algorithms for disambiguating senses and identifying text canonical form; and (5) to represent semantic-based text anomaly using First Order Logic (FOL) and Concept Network Graph (CNG). ESET performs text anomaly detection by employing optimized Cosine Similarity, hybridizing LSA with modified SET, and integrating it with Word Sense Disambiguation algorithms specifically Lesk and Selectional Preference. Then, FOL and CNG are proposed to represent the detected semantic-based text anomaly. To demonstrate the feasibility of the technique, four selected datasets namely NIPS data, ENRON, Daily Koss blog, and 20Newsgroups were experimented on. The experimental evaluation revealed that ESET has significantly improved the accuracy of detecting semantic-based text anomaly from documents. When compared with existing measures, the experimental results outperformed benchmarked methods with an improved F1-score from all datasets respectively; NIPS data 0.75, ENRON 0.82, Daily Koss blog 0.93 and 20Newsgroups 0.97. The results generated from ESET has proven to be significant and supported a growing notion of semantic-based text anomaly which is increasingly evident in existing literatures. Practically, this study contributes to topic modelling and concept coherence for the purpose of visualizing information, knowledge sharing and optimized decision making.
format Thesis
qualification_name Doctor of Philosophy (PhD.)
qualification_level Doctorate
author Taiye, Mohammed Ahmed
author_facet Taiye, Mohammed Ahmed
author_sort Taiye, Mohammed Ahmed
title An enhanced sequential exception technique for semantic-based text anomaly detection
title_short An enhanced sequential exception technique for semantic-based text anomaly detection
title_full An enhanced sequential exception technique for semantic-based text anomaly detection
title_fullStr An enhanced sequential exception technique for semantic-based text anomaly detection
title_full_unstemmed An enhanced sequential exception technique for semantic-based text anomaly detection
title_sort enhanced sequential exception technique for semantic-based text anomaly detection
granting_institution Universiti Utara Malaysia
granting_department Awang Had Salleh Graduate School of Arts & Sciences
publishDate 2019
url https://etd.uum.edu.my/8112/1/s900757_01.pdf
https://etd.uum.edu.my/8112/2/s900757_02.pdf
_version_ 1747828328204599296
spelling my-uum-etd.81122022-05-09T08:13:52Z An enhanced sequential exception technique for semantic-based text anomaly detection 2019 Taiye, Mohammed Ahmed Kamaruddin, Siti Sakira Kabir Ahmad, Farzana Awang Had Salleh Graduate School of Arts & Sciences Awang Had Salleh Graduate School of Arts & Sciences T58.5-58.64 Information technology QA273-280 Probabilities. Mathematical statistics The detection of semantic-based text anomaly is an interesting research area which has gained considerable attention from the data mining community. Text anomaly detection identifies deviating information from general information contained in documents. Text data are characterized by having problems related to ambiguity, high dimensionality, sparsity and text representation. If these challenges are not properly resolved, identifying semantic-based text anomaly will be less accurate. This study proposes an Enhanced Sequential Exception Technique (ESET) to detect semantic-based text anomaly by achieving five objectives: (1) to modify Sequential Exception Technique (SET) in processing unstructured text; (2) to optimize Cosine Similarity for identifying similar and dissimilar text data; (3) to hybridize modified SET with Latent Semantic Analysis (LSA); (4) to integrate Lesk and Selectional Preference algorithms for disambiguating senses and identifying text canonical form; and (5) to represent semantic-based text anomaly using First Order Logic (FOL) and Concept Network Graph (CNG). ESET performs text anomaly detection by employing optimized Cosine Similarity, hybridizing LSA with modified SET, and integrating it with Word Sense Disambiguation algorithms specifically Lesk and Selectional Preference. Then, FOL and CNG are proposed to represent the detected semantic-based text anomaly. To demonstrate the feasibility of the technique, four selected datasets namely NIPS data, ENRON, Daily Koss blog, and 20Newsgroups were experimented on. The experimental evaluation revealed that ESET has significantly improved the accuracy of detecting semantic-based text anomaly from documents. When compared with existing measures, the experimental results outperformed benchmarked methods with an improved F1-score from all datasets respectively; NIPS data 0.75, ENRON 0.82, Daily Koss blog 0.93 and 20Newsgroups 0.97. The results generated from ESET has proven to be significant and supported a growing notion of semantic-based text anomaly which is increasingly evident in existing literatures. Practically, this study contributes to topic modelling and concept coherence for the purpose of visualizing information, knowledge sharing and optimized decision making. 2019 Thesis https://etd.uum.edu.my/8112/ https://etd.uum.edu.my/8112/1/s900757_01.pdf text eng public https://etd.uum.edu.my/8112/2/s900757_02.pdf text eng public phd doctoral Universiti Utara Malaysia A.Rajaraman, J. Leskovec, J. D. U. (2016). Mining Massive Data Sets Winter 2016. Cambridge University Press. Retrieved from http://web.stanford.edu/class/cs246 ABDULSAHIB, A. K. (2015). Graph based text representation for document clustering asma khazaal abdulsahib. Abdulsahib, A. K., & Kamaruddin, S. S. (2015). Graph based text representation for document clustering. Journal of Theoretical and Applied Information Technology, 76(1), 1–13. Retrieved from http://www.scopus.com/inward/record.url?eid=2-s2.0- 84930694414&partnerID=40&md5=5c7f0059c26594915cdf9360315173c7 Abouzakhar, N., Allison, B., & Guthrie, L. (2008). Unsupervised Learning-based Anomalous Arabic Text Detection. Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC’08), 291–296. Retrieved from http://www.lrec-conf.org/proceedings/lrec2008/summaries/83.html Acree, B., Jansa, J., & Shoub, K. (2016). Comparing and Evaluating Cosine Similarity Scores, Weighted Cosine Similarity Scores, and Substring Matching. Retrieved from https://shoub.web.unc.edu/files/2016/04/AHJS_Weighted_Cosine.pdf Adler-Golden, S. M. (2009). Improved hyperspectral anomaly detection in heavy-tailed backgrounds. WHISPERS ’09 - 1st Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing, 2–5. https://doi.org/10.1109/WHISPERS.2009.5289019 Aggarwal, C., & Zhai, C. (2012). Mining text data. (C. C. C. Z. AGGARWAL, Ed.), Mining Text Data (Vol. 4). Kluwer Academic Publishers Boston/Dordrecht/London. https://doi.org/10.1007/978-1-4614-3223-4 Agirre, E., & Martinez, D. (2002). Integrating selectional preferences in WordNet. Proceedings of the First International WordNet Conference, 9. Retrieved from http://arxiv.org/abs/cs/0204027 Akarsu, B., Bayram, K., Slisko, J., & Corona Cruz, A. (2013). International Journal Of Scientific Research And Education. Ijsae.In, 6(3), 221–232. Retrieved from http://ijsae.in/ijsaeems/index.php/ijsae/article/viewFile/157/137 Akoglu, L., Tong, H., & Koutra, D. (2014). Graph-based Anomaly Detection and Description: A Survey. ArXiv Preprint ArXiv:1404.4679, 49. https://doi.org/10.1007/s10618-014-0365-y Alagi, D. (2009). Experiments on Active Learning for Croatian Word Sense Disambiguation. Allan Collins, J. S. B., Larkin, & K. M., & Newman, B. B. and. (2007). INFERENCE IN TEXT UNDERSTANDING. University of Illinois at Urbana- Champaign 51 Gerty Drive Champaign, Illinois 61820. Allan, J., Carbonell, J., & Doddington, G. (1998). Topic detection and tracking pilot study: Final report. DARPA Broadcast News Transcription and Understanding Workshop., 194–218. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21.6373&rep=rep1&type=pdf Almarimi, A., & Andrejková, G. (2016). Text Anomalies Detection Using Histograms of Words. ACSIJ Advances in Computer Science: An International Journal, 5(1), 63– 68. Arning, A., & Rakesh, A. (1996). Method for Deviation in Large Databases. KDD-96 Proceedings. Atefeh, F., & Khreich, W. (2015). A Survey of Techniques for Event Detection in Twitter TECHNIQUES FOR EVENT DETECTION IN TWITTER. Computational Intelligence, 0(1), 132–164. https://doi.org/10.1111/coin.12017 Balbi, S. (2010). Beyond the curse of multidimentionality: high dimentional clustering in context mining. Statistica Applicata - Italian Journal of Applied Statistics, 22(1), 53–63. Banerjee, S. (2002). Adapting the Lesk Algorithm for Word Sense Disambiguation to WordNet, (December). Basile, P., Caputo, A., & Semeraro, G. (2014). An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model. Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers (COLING 14), 1591–1600. Belford, M., Mac Namee, B., & Greene, D. (2018). Stability of topic modeling via matrix factorization. Expert Systems with Applications, 91, 159–169. https://doi.org/10.1016/j.eswa.2017.08.047 Beltagy, I., Roller, S., Cheng, P., Erk, K., & Mooney, R. J. (2015). Representing Meaning with a Combination of Logical Form and Vectors, 1–44. Retrieved from http://arxiv.org/abs/1505.06816 Berant, J., Chou, A., Frostig, R., & Liang, P. (2013). Semantic Parsing on Freebase from Question-Answer Pairs. Proceedings of EMNLP, (October), 1533–1544. Retrieved from https://www.aclweb.org/anthology/D/D13/D13- 1160.pdf%5Cnhttp://www.samstyle.tk/index.pl/00/http/nlp.stanford.edu/pubs/semp arseEMNLP13.pdf Bernotas, M., Karklius, K., Laurutis, R., & Slotkiene, A. (2007). The peculiarities of the text document representation, using ontology and tagging-based clustering technique. Information Technology and Control, 36(2), 217–220. Bertoldi, N., Cettolo, M., & Federico, M. (2010). Statistical Machine Translation of Texts with Misspelled Words. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, (June), 412–419. Bhaduri, K., Matthews, B. L., & Giannella, C. R. (2011). Algorithms for speeding up distance-based outlier detection. Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 859–867. https://doi.org/10.1145/2020408.2020554 Blei, D. M., Ng, A. Y., & Jordan, M. I. (2012). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3(4–5), 993–1022. https://doi.org/10.1162/jmlr.2003.3.4-5.993 Boyd-Graber, J., Blei, D. M., & Zhu, X. (2007). A Topic Model for Word Sense Disambiguation. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’07), 1024–1033. Brants, T., Chen, F., & Farahat, A. (2003). A system for new event detection. ACM SIGIR Conference on Research and Development in Informaion Retrieva, (pp. 330- 337). Brants, T., Chen, F., & Tsochantaridis, I. (2002). Topic-based document segmentation with probabilistic latent semantic analysis. Proceedings of the Eleventh International Conference on Information and Knowledge Management CIKM 02, 211. https://doi.org/10.1145/584792.584829 Breja, M. (2015). A Novel approach for Novelty Detection of Web Documents, 6(5), 4257–4262. Brody, S. (2005). Cluster-Based Pattern Recognition in Natural Language Text. English, (August). Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.7288&rep=rep1 &type=pdf Bruynooghe, M., & Denecker, M. (2014). First Order Logic with Inductive Definitions for Model-Based Problem Solving. Bustince, H., Fernadez, J., & Mesiar, R. (2011). Restricted dissimilarity functions and penalty functions. Eusflat-Lfa 2011, (July). Retrieved from http://library.utia.cas.cz/separaty/2012/E/mesiar-restricted dissimilarity functions and penalty functions.pdf Cai, D., He, X., Wu, X., & Han, J. (2008). Non-negative matrix factorization on manifold. Proceedings - IEEE International Conference on Data Mining, ICDM, 63–72. https://doi.org/10.1109/ICDM.2008.57 Cambria, E., & Melfi, G. (2015). Semantic Outlier Detection for Affective Common- Sense Reasoning and Concept-Level Sentiment Analysis, 276–281. Cammert, M., Heinz, C., Kramer, J., & Riemenschneider, T. (n.d.). Systems and/or methods for event stream deviation detection. U.S. Patent No. 9,659,063. Washington, DC: U.S. Patent and Trademark Office. Retrieved from https://www.google.com/patents/US9659063 Capurro, I., Lecumberry, F., Martín, Á., Ramírez, I., Rovira, E., & Seroussi, G. (2016). Efficient sequential compression of multi-channel biomedical signals. IEEE Journal of Biomedical and Health Informatics, PP(NN), 13. Retrieved from http://arxiv.org/abs/1605.04418 Cha, S. (2007). Comprehensive Survey on Distance / Similarity Measures between Probability Density Functions, 1(4). Chandarana, D. R. (2015). A Survey for Different Approaches of Outlier Detection in Data Mining, 1–4. Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys (CSUR), 41(September), 1–58. https://doi.org/10.1145/1541880.1541882 Chaplot, D. S., & Salakhutdinov, R. (2018). Knowledge-based Word Sense Disambiguation using Topic Models. Retrieved from http://arxiv.org/abs/1801.01900 Chen, X., & Wu, C. (2012). A Text Representation Method Based on Harmonic Series. In IEEE 11th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), 2012 (pp. 1830–1834). Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, I. (2008). Text classification and Naive Bayes. Retrieved from lp.stanford.edu/IR- book/html/htmledition/text-classification-and-naive-bayes-1.html Cichosz, P. (2018). Anomaly detection in discussion forum posts using global vectors. In SPIE. Proc. SPIE 10808. https://doi.org/10.1117/12.2501345 Classen, A., Boucher, Q., & Heymans, P. (2011). A text-based approach to feature modelling: Syntax and semantics of TVL. Science of Computer Programming, 76(12), 1130–1143. https://doi.org/10.1016/j.scico.2010.10.005 Dang, S., & Ahmad, P. H. (2014). Text Mining : Techniques and its Application, 1(4), 22–25. Debortoli, S., Müller, O., Junglas, I. A., & vom Brocke, J. (2016). Text Mining for Information Systems Researchers: An Annotated Tutorial. Manuscript Submitted for Publication, (April). Deshpande, R., Vaze, K., Rathod, S., & Jarhad, T. (2014). Comparative Study of Document Similarity Algorithms and Clustering Algorithms for Sentiment Analysis. Ijettcs.Org, 3(5), 196–199. Retrieved from http://www.ijettcs.org/Volume3Issue5/IJETTCS-2014-10-21-85.pdf Ding, R., Nallapati, R., Xiang, B., & Services, A. W. (2016). Coherence-Aware Neural Topic Modeling, 1. Drissi, M., & Watkins, O. (2017). Hierarchical Text Generation using an Outline. Eshghi, A., Howes, C., Gregoromichelaki, E., Hough, J., & Purver, M. (2015). Feedback in Conversation as Incremental Semantic Update. Iwcs 2015. Retrieved from http://www.aclweb.org/website/old_anthology/W/W15/W15-01.pdf#page=123 Faruqui, M., Tsvetkov, Y., Rastogi, P., & Dyer, C. (2016). Problems With Evaluation of Word Embeddings Using Word Similarity Tasks. https://doi.org/10.18653/v1/W16- 2506 Foltz, P. W. (1996). Latent Semantic Analysis for Text-Based. Behavior Research Methods, Instruments and Computers, 28(2), 197–202. https://doi.org/10.3758/BF03204765 Franzoni, V. (2017). Just an Update on PMING Distance for Web-based Semantic Similarity in Artificial Intelligence and Data Mining, 1–3. https://doi.org/10.13140/RG.2.2.20531.22560 Froud, H., Lachkar, A., & Ouatik, S. (2013). Arabic text summarization based on latent semantic analysis to enhance Arabic documents clustering. ArXiv Preprint ArXiv:1302.1612. Retrieved from http://arxiv.org/abs/1302.1612 Furtado, P., Nadal, S., Peralta, V., Djedaini, M., & Marcel, P. (2015). Materializing Baseline Views for Deviation Detection Exploratory OLAP, 1–12. Fyshe, A., Talukdar, P., Murphy, B., & Mitchell, T. (2013). Documents and Dependencies : an Exploration of Vector Space Models for Semantic Composition. Conll, 84–93. Gabrilovich, Evgeniy, and S. M. (2005). Feature generation for text categorization using world knowledge. IJCAI International Joint Conference on Artificial Intelligence, 5(pp. 1048-1053.). Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. IJCAI International Joint Conference on Artificial Intelligence, 1606–1611. https://doi.org/10.1145/2063576.2063865 Gahl, S., Menn, L., Ramsberger, G., Jurafsky, D. S., Elder, E., Rewega, M., & Audrey, L. H. (2003). Syntactic frame and verb bias in aphasia: Plausibility judgments of undergoer-subject sentences. Brain and Cognition, 53(2), 223–228. https://doi.org/10.1016/S0278-2626(03)00114-3 Garrette, D., Erk, K., & Mooney, R. (2014). A Formal Approach to Linking Logical Form and Vector-Space Lexical Semantics. Computing Meaning SE - 3, 47, 27–48. https://doi.org/10.1007/978-94-007-7284-7_3 Gelbukh, A., Sidorov, G., & Han, S.-Y. (2005). On some optimization heuristics for lesk-like WSD algorithms. Nldb’05, 402–405. Giannoulis, P., Potamianos, G., & Maragos, P. (2018). On the Joint Use of NMF and Classification for Overlapping Acoustic Event Detection. Proceedings, 2(2), 90. https://doi.org/10.3390/proceedings2020090 Gilad Katz, Yuval Elovici, & B. S. (2014). SEMANTIC BASED CONTEXTUAL CLUSTERING FOR DATA LEAKAGE PREVENTION THROUGH ANOMALY DETECTION. Gloor, P. A., Niepel, S., L, Y., Whalley, G., Skilling, J. K., Kitchen, L., & Causey, R. (2006). Identifying Potential Suspects by Temporal Link Analysis Discovering Suspicious Activity in the Enron e-Mail Dataset Filtering by Keywords, 9. Godbole, S. (2002). Exploiting confusion matrices for automatic generation of topic hierarchies and scaling up multi-way classifiers. Progress Report, IIT Bombay, (March 2002), 17. Retrieved from http://www.it.iitb.ac.in/~shantanu/work/report.pdf Goldstein, M., Goldstein, M., & Uchida, S. (2016). A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data. PLoS ONE, (April), 1–31. https://doi.org/10.7910/DVN/OPQMVF Gomaa, W. H. (2013). A Survey of Text Similarity Approaches. International Journal of Computer Applications, 68(13), 13–18. Gong, Y., Zhao, K., & Zhu, K. Q. (2016). Representing Verbs as Argument Concepts. Proceedings of the 30th Conference on Artificial Intelligence (AAAI 2016), 2615– 2621. Goodfellow, I. (2016). NIPS 2016 Tutorial: Generative Adversarial Networks. https://doi.org/10.1001/jamainternmed.2016.8245 Guthrie, D. (2008). Unsupervised Detection of Anomalous Text. Distribution, (July). Guthrie, D., Guthrie, L., Allison, B., & Wilks, Y. (2007). Unsupervised anomaly detection. IJCAI International Joint Conference on Artificial Intelligence, 1624– 1628. H, S. D., M, M. K., & Science, C. (2015). International Journal of Combined Research & Development ( IJCRD ) eISSN : 2321-225X ; pISSN : 2321-2241 Volume : 4 ; Issue : 2 ; February -2015 A Survey on Text Mining Approaches International Journal of Combined Research & Development ( IJCRD ), 251–256. Han, J. (2014). Data Mining : Concepts and Techniques. Hardin, J. S., Sarkis, G., & Urc, P. C. (2015). Network analysis with the enron email corpus. Journal of Statistics Education, 23(2). https://doi.org/10.1080/10691898.2015.11889734 Hassan, S., & Mihalcea, R. (2011). Semantic Relatedness Using Salient Semantic Analysis. Proceedings of the 25th AAAI Conference on Artificial Intelligence, (AAAI 2011), 884–889. Retrieved from http://www.samerhassan.com/images/4/48/Hassan.pdf%5Cnhttp://www.aaai.org/oc s/index.php/AAAI/AAAI11/paper/download/3616/3972 Héas, P., Drémeau, A., & Herzet, C. (2016). An Efficient Algorithm for Video Superresolution Based on a Sequential Model. SIAM Journal on Imaging Sciences, 9(2), 537–572. https://doi.org/10.1137/15M1023956 Henriksson, A., Moen, H., Skeppstedt, M., Daudaravičius, V., & Duneld, M. (2014). Synonym extraction and abbreviation expansion with ensembles of semantic spaces. Journal of Biomedical Semantics, 5(1), 6. https://doi.org/10.1186/2041- 1480-5-6 Hirschberg, J., & Manning, C. D. (2015). Advances in natural language processing. Science, 349(6245), 261–266. https://doi.org/10.1126/science.aaa8685 Hodge, V. J., & Austin, J. (2004). A Survey of Outlier Detection Methodoligies. Artificial Intelligence Review, 22(1969), 85–126. https://doi.org/10.1007/s10462- 004-4304-y Huang, A. (2008). Similarity measures for text document clustering. Proceedings of the Sixth New Zealand, (April), 49–56. Retrieved from http://nzcsrsc08.canterbury.ac.nz/site/proceedings/Individual_Papers/pg049_Simila rity_Measures_for_Text_Document_Clustering.pdf Issa, H., & Vasarhelyi, M. A. (2011). Application of Anomaly Detection Techniques to Identify Fraudulent Refunds. SSRN Working Papers Series, 1–19. https://doi.org/10.2139/ssrn.1910468 Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651–666. https://doi.org/10.1016/j.patrec.2009.09.011 Janz, A., Kȩdzia, P., & Piasecki, M. (2018). Graph-based complex representation in inter-sentence relation recognition in Polish texts. Cybernetics and Information Technologies, 18(1), 152–170. https://doi.org/10.2478/cait-2018-0013 Jiang, L., Zhang, H., Yang, X., & Xie, N. (2013). Research on Semantic Text Mining Based on Domain Ontology, 336–343. Joachims, T. (1998). Text Categorization with Suport Vector Machines: Learning with Many Relevant Features. Proceedings of the 10th European Conference on Machine Learning, 137–142. https://doi.org/10.1007/BFb0026683 Jurafsky, D., & Martin, J. H. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Speech and Language Processing An Introduction to Natural Language Processing Computational Linguistics and Speech Recognition, 21, 0– 934. https://doi.org/10.1162/089120100750105975 Kamaruddin, S. S. B. (2011). FRAMEWORK FOR DEVIATION DETECTION IN TEXT. Kamaruddin, S. S., Bakar, A. A., Hamdan, A. R., Nor, F. M., Nazri, M. Z. A., Othman, Z. A., & Hussein, G. S. (2015). A text mining system for deviation detection in financial documents. Intelligent Data Analysis, 19(s1), S19–S44. https://doi.org/10.3233/IDA-150768 Kamaruddin, S. S., Hamdan, A. R., & Bakar, A. A. (2007). Text Mining for Deviation Detection in Financial Statement, 446–449. Kamaruddin, S. S., Hamdan, A. R., Bakar, A. A., & Mat Nor, F. (2012). Deviation detection in text using conceptual graph interchange format and error tolerance dissimilarity function. Intelligent Data Analysis, 16(3), 487–511. https://doi.org/10.3233/IDA-2012-0535 Kamruzzaman, S. M., Haider, F., & Hasan, A. R. (2010). Text Classification using Data Mining. Science, 19. Retrieved from http://arxiv.org/abs/1009.4987 Kannan, R., Woo, H., Aggarwal, C. C., & Park, H. (2017). Outlier Detection for Text Data : An Extended Version. ArXiv, 489–497. Kannan, Ramakrishnan, Woo, H., Aggarwal, C. C., & Park, H. (2017). Outlier Detection for Text Data : An Extended Version. Retrieved from http://arxiv.org/abs/1701.01325 Karkali, M., Rousseau, F., Ntoulas, A., & Vazirgiannis, M. (2014). Using temporal IDF for efficient novelty detection in text streams. ArXiv, 30. Retrieved from http://arxiv.org/abs/1401.1456 Katariya, N. P., & Chaudhari, M. S. (2015). 126. Text Preprocessing for Text Mining Using Side Information. International Journal of Computer Science and Mobile Applications, 3, 3–7. Kim, J., & Montague, P. (2017). An Efficient Semi-Supervised SVM for Anomaly Detection, 2843–2850. Kobus, C., Yvon, F., & Damnati, G. (2008). Normalizing SMS: are two metaphors better than one? Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, (August), 441–448. Retrieved from http://dl.acm.org/citation.cfm?id=1599137 Koehrsen, W. (2017). Machine Learning with Python on the Enron Dataset. Retrieved November 23, 2018, from https://medium.com/@williamkoehrsen/machine- learning-with-python-on-the-enron-dataset-8d71015be26d Kshirsagar, M., Thomson, S., Schneider, N., Carbonell, J., Smith, N. a, & Dyer, C. (2015). Frame-Semantic Role Labeling with Heterogeneous Annotations. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 218–224. Kumar, a A. (2012). Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering Sri Sivani College of Engineering Sri Sivani College of Engineering, 1(5), 1–6. Kumar Palaniswamy Supervisor, H., & Aldous, D. (2015). Exploratory Data Analysis of Enron Emails. Kumaraswamy, R., & Shavlik, J. (2012). Anomaly Detection in Text : The Value of Domain Knowledge, 225–228. Lee, Hanjun, Keunho Choi, Donghee Yoo, Yongmoo Suh, Soowon Lee, G. H. (2017). Recommending valuable ideas in an open innovation community A text mining approach to information overload problem. https://doi.org/10.1108/eb057530 Lenci, A., Montemagni, S., & Pirrelli, V. (2001). The Acquisition and Representation of Word Meaning The Acquisition and Representation of Word Meaning . An Overview. Lesk, M. (1986). Automatic sense disambiguation using machine readable dictionaries. Proceedings of the 5th Annual International Conference on Systems Documentation- SIGDOC ’86, 24–26. https://doi.org/10.1145/318723.318728 Leveling, J. (2007). IRSAW – Towards Semantic Annotation of Documents for Question Answering. Leyzerov, O. (2017). Identifing Fraud from Enron Email and financial data. Retrieved November 23, 2018, from https://olegleyz.github.io/enron_classifier.html Li, L., Hu, X., Hu, B. Y., Wang, J., & Zhou, Y. M. (2009). Measuring sentence similarity from different aspects. Proceedings of the 2009 International Conference on Machine Learning and Cybernetics, 4(July), 2244–2249. https://doi.org/10.1109/ICMLC.2009.5212182 Li, L. I. N., Hu, X. I. A., Hu, B., Wang, J. U. N., & Zhou, Y. (2009). MEASURING SENTENCE SIMILARITY FROM DIFFERENT ASPECTS, (July), 12–15. Li, X., Member, D. F., Croft, W. B., Head, D., & University, B. E. T. (2006). Sentence Level Information Patterns for Novelty Detection, 1–10. https://doi.org/10.1145/1183614.1183652 Liang, H., Tsai, F. S., & Kwee, A. T. (2009). Detecting novel business blogs. ICICS 2009 - Conference Proceedings of the 7th International Conference on Information, Communications and Signal Processing. https://doi.org/10.1109/ICICS.2009.5397541 Lin, Y.-S., Jiang, J.-Y., & Lee, S.-J. (2014). A Similarity Measure for Text Classification and Clustering. IEEE Transactions on Knowledge and Data Engineering, 26(7), 1575–1590. https://doi.org/10.1109/TKDE.2013.19 Liu, H., Ke, W., Wei, K. K., & Hua, Z. (2013). The impact of IT capabilities on firm performance: The mediating roles of absorptive capacity and supply chain agility. Decision Support Systems, 54(3), 1452–1462. https://doi.org/10.1016/j.dss.2012.12.016 Liu, Z. (2013). High Performance Latent Dirichlet Allocation for Text Mining. M. J. Denny & A. Spirling. (2018). Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It. Mahapatra, A., Srivastava, N., & Srivastava, J. (2012). Contextual anomaly detection in text data. Algorithms, 5(4), 469–489. https://doi.org/10.3390/a5040469 Maitra, Anutosh (Bangalore, I., Mohamedrasheed, Annervaz Karukapadath (Trichur, I., Jain, Tom Geo (Bangalore, I., Shivaram, Madhura (Bangalore, I., Sengupta, Shubhashis (Bangalore, I., Ramnani, Roshni Ramesh (Bangalore, I., … Sahu, Vedamati (Bangalore, I. (2016). SYSTEM FOR AUTOMATED ANALYSIS OF CLINICAL TEXT FOR PHARMACOVIGILANCE. Retrieved June 17, 2016, from http://www.freepatentsonline.com/y2016/0048655.html Manevitz, L. M. (2001). One-Class SVMs for Document Classification. Journal of Machine Learning Research, 2, 139–154. https://doi.org/10.1162/15324430260185574 Margaret Rouse. (2005). First order predicate Logic. Retrieved October 3, 2015, from http://whatis.techtarget.com/definition/first-order-logic Marvin, R. (2018). Exploring Word Sense Disambiguation Abilities of Neural Machine Translation Systems, 1, 125–131. McInnes, B. T., & Pedersen, T. (2013). Evaluating measures of semantic similarity and relatedness to disambiguate terms in biomedical text. Journal of Biomedical Informatics, 46(6), 1116–1124. https://doi.org/10.1016/j.jbi.2013.08.008 Meystre, S. M., Savova, G. K., Kipper-Schuler, K. C., & Hurdle, J. F. (2008). Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research. IMIA Yearbook of Medical Informatics Methods Inf Med, 47(1), 128–144. https://doi.org/me08010128 Mihalcea, R., Corley, C., & Strapparava, C. (2006). Corpus-based and knowledge-based measures of text semantic similarity. Proceedings of the 21st National Conference on Artificial Intelligence, 1, 775–780. https://doi.org/10.1.1.65.3690 Miller, R. C., & Myers, B. A. (2001). Outlier finding. Proceedings of the 14th Annual ACM Symposium on User Interface Software and Technology - UIST ’01, 81. https://doi.org/10.1145/502348.502361 Montes-y-gómez, M., Gelbukh, A. F., & López-lópez, A. (2002a). Detecting Deviations in Text Collections: An Approach Using Conceptual Graphs. Mexican International Conference on Artificial Intelligence, 176–184. https://doi.org/10.1007/3-540-46016-0_19 Montes-y-gómez, M., Gelbukh, A., & López-lópez, A. (2002b). Text Mining at Detail Level Using Conceptual Graphs, 122–136. Nakov, P. (2013). On the interpretation of noun compounds: Syntax, semantics, and entailment. Natural Language Engineering, 19(03), 291–330. https://doi.org/10.1017/S1351324913000065 Navigli, R. (2009a). Word sense disambiguation: A survey. ACM Computing Surveys (CSUR), 41(3), 10. https://doi.org/10.1145/1459352.1459355 Navigli, R. (2009b). Word sense disambiguation. ACM Computing Surveys, 41(2), 1–69. https://doi.org/10.1145/1459352.1459355 Ngai, E. W. T., Hong, T., Polytechnic, K., Hom, H., Kong, H., Hom, H., & Kong, H. (2016). a Review of the Literature on Applications of Text Mining in Policy Making. Oberreuter, G., & Velásquez, J. D. (2013). Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style. Expert Systems with Applications, 40(9), 3756–3763. https://doi.org/10.1016/j.eswa.2012.12.082 Otterbacher, J., & Radev, D. (2006). Fact-focused novelty detection: A feasibility study. Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2006, 687–688. https://doi.org/10.1145/1148170.1148318 Pappas, Y. (2018). Fraud Detection Using Machine Learning (Analysis). Retrieved November 23, 2018, from http://www.yannispappas.com/Fraud-Detection-Using- Machine-Learning/ Parr, T. (2012). jguru. Retrieved January 1, 2015, from http://www.jguru.com/faq/view.jsp?EID=81 Patel, F. N., & Soni, N. R. (2012). Text mining: A Brief survey. International Journal of Advanced Computer Research, 2(6), 243–248. Retrieved from http://www.theaccents.org/ijacr/papers/conference/icett2012/43.pdf Pawar, A. M. (2015). A Comprehensive Survey on Online Anomaly Detection, 119(17), 41–45. Peter Norvig. (2015). Natural Language Processing What We Do. Retrieved December 9, 2015, from http://research.google.com/pubs/NaturalLanguageProcessing.html Poon, H., & Domingos, P. (2010). Unsupervised ontology induction from text. Proceedings of the 48th Annual Meeting of the …, (July), 296–305. Retrieved from http://dl.acm.org/citation.cfm?id=1858712 Powers, D. M. W. (2015). What the F-measure doesn’t measure: Features, Flaws, Fallacies and Fixes. https://doi.org/KIT-14-001 Pradhan, N., Gyanchandani, M., & Wadhvani, R. (2015). A Review on Text Similarity Technique used in IR and its Application. International Journal of Computer Applications, 120(9), 29–34. https://doi.org/10.5120/21257-4109 Provost, F., Fawcett, T., & Kohavi, R. (1997). The Case Against Accuracy Estimation for Comparing Induction Algorithms. Proceedings of the Fifteenth International Conference on Machine Learning1, 445–453. Ramage, D., Heymann, P., Manning, C. D., & Garcia-Molina, H. (2009). Clustering the tagged web. Proceedings of the Second ACM International Conference on Web Search and Data Mining - WSDM ’09, 54. https://doi.org/10.1145/1498759.1498809 Ramya, R. S., Venugopal, K. R., Iyengar, S. S., & Patnaik, L. M. (2016). Feature Extraction and Duplicate Detection for, 16(5). Ray, S., & Craven, M. (2001). Representing sentence structure in hidden Markov models for information extraction. International Joint Conference On, 17(1), 1273– 1279. Retrieved from http://scholar.google.com/scholar?q=intitle:Representing+Sentence+Structure+in+ Hidden+Markov+Models+for+Information+Extraction#0 Ren, F., & Sohrab, M. G. (2013). Class-indexing-based term weighting for automatic text classification. Information Sciences, 236, 109–125. https://doi.org/10.1016/j.ins.2013.02.029 Rennie, J. (2008). 20 Newsgroups. Retrieved November 2, 2018, from http://qwone.com/~jason/20Newsgroups/ Rosario, B., & Hearst, M. a. (2004). Classifying semantic relations in bioscience texts. Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, 430. https://doi.org/10.3115/1218955.1219010 Rosenberg, A., & Hirschberg, J. (2007). V-measure: A conditional entropy-based external cluster evaluation measure. Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language (EMNLP-CoNLL’07), 1(June), 410–420. https://doi.org/10.7916/D80V8N84 Rumshisky, A. (2008). Resolving Polysemy in Verbs: Contextualized Distributional Approach to Argument Semantics. Distributional Models of the Lexicon in Linguistics and Cognitive Science, Special Issue of Italian Journal of Linguistics, 1–27. Sardar, R. P. ; S. S. ; S. K. N. ; M. M. (2018). Improving Lesk by Incorporating Priority for Word Sense Disambiguation. https://doi.org/10.1109/EAIT.2018.8470436 Sayeed, A., Greenberg, C., & Demberg, V. (2016). Thematic fit evaluation: an aspect of selectional preferences. ACL 2016, 99. Silveira, S. B., & Branco, A. (2012). Combining a double clustering approach with sentence simplification to produce highly informative multi-document summaries. Proceedings of the 2012 IEEE 13th International Conference on Information Reuse and Integration, IRI 2012, (1), 482–489. https://doi.org/10.1109/IRI.2012.6303047 Slimani, T. (2013). Description and Evaluation of Semantic Similarity Measures Approaches. International Journal of Computer Applications, 80(10), 25–33. https://doi.org/10.5120/13897-1851 Steinberger, J., & Ježek, K. (2004). Using Latent Semantic Analysis in Text Summarization. In Proceedings of ISIM 2004, 93--100. Sugiyama, M., & Borgwardt, K. (2013). Rapid Distance-Based Outlier Detection via Sampling. Advances in Nueral Information Processing Systems 26 (Proceedings of NIPS), 1–9. Sun, F., Guo, J., Lan, Y., Xu, J., & Cheng, X. (2016). Semantic Regularities in Document Representations. Retrieved from http://arxiv.org/abs/1603.07603 Szmeja, P., Ganzha, M., Paprzycki, M., & Pawłowski, W. (2018). Dimensions of Semantic Similarity, 87–125. Takahashi, T. (2011). Discovering Emerging Topics in Social Streams via Link Anomaly Detection.pdf, 26, 1–18. https://doi.org/10.1109/icdm.2011.53 Tan, L., Zhang, H., Clarke, C. L. a, & Smucker, M. D. (2015). Lexical Comparison Between Wikipedia and Twitter Corpora by Using Word Embeddings. Acl, 657– 661. Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction to Data Mining. Introduction to Data Mining, 769. Torres, S., & Gelbukh, A. (2009). Comparing Similarity Measures for Original WSD Lesk Algorithm. Advances in Computer Science and Applications, 43, 155–166. Tsai, F. S. (2007). Novelty detection for text documents using named entity recognition. 2007 6th International Conference on Information, Communications & Signal Processing, (3), 1–5. https://doi.org/10.1109/ICICS.2007.4449883 Turney, P. D., & Pantel, P. (2010). ★★★★★From Frequency to Meaning_ Vector Space Models of Semantics(讲的非常好,但是我还只看了三分之一).pdf, 37, 141–188. https://doi.org/10.1613/jair.2934 Upadhyaya, S., & Singh, K. (2012). Classification based outlier detection techniques. Int J Comput Trends Technol, 3, 294–298. Retrieved from http://www.ijcttjournal.org/Volume3/issue-2/IJCTT-V3I2P118.pdf Wagner, A. (2000). Enriching a lexical semantic net with selectional preferences by means of statistical corpus analysis. Proceedings of ECAI Workshop on Ontology Learning and Population, 37–42. Retrieved from http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:Enriching+a+Le xical+Semantic+Net+with+Selectional+Preferences+by+Means+of+Statistical+Cor pus+Analysis#0 Wang, Y., Ni, X., Sun, J.-T., Tong, Y., & Chen, Z. (2011). Representing document as dependency graph for document clustering. Proceedings of the 20th ACM International Conference on Information and Knowledge Management - CIKM ’11, 2177. https://doi.org/10.1145/2063576.2063920 Wehmeier, K. F. (2004). Wittgensteinian Predicate Logic. Notre Dame Journal of Formal Logic, 45(1), 1–11. https://doi.org/10.1305/ndjfl/1094155275 William Wei Song, Chenlu Lin, A. F. (2017). An Euclidean similarity measurement approach for hotel rating data analysis. Retrieved from https://ieeexplore.ieee.org/abstract/document/7951927/authors Yan, X., Guo, J., Lan, Y., & Cheng, X. (2013). A biterm topic model for short texts. WWW ’13 Proceedings of the 22nd International Conference on World Wide Web, 1445–1456. Retrieved from http://dl.acm.org/citation.cfm?id=2488388.2488514 Yang, Y., Zhang, J., Carbonell, J., & Jin, C. (2002). Topic-conditioned novelty detection. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’02, 688. https://doi.org/10.1145/775047.775150 Yih, W., & Meek, C. (n.d.). Improving Similarity Measures for Short Segments of Text, 1489–1494. Yin, J., & Wang, J. (2016). A Model-based Approach for Text Clustering with Outlier Detection. Icde, 625–636. https://doi.org/10.1109/ICDE.2016.7498276 Yoo, J., & Yang, D. (2015). Classification Scheme of Unstructured Text Document using TF-IDF and Naive Bayes Classifier Text Classification using TF-IDF and Naï ve Bayes Classifier, 111(Comcoms), 263–266. https://doi.org/10.14257/astl.2015.111.50 Yuhanis, S. S. kamaruddin and Y. (2015). constructing canonical data model for text document clustering, 4. Zhang, D., Zhai, C., Han, J., Srivastava, A., & Oza, N. (2009). Topic modeling for OLAP on multidimensional text databases: Topic cube and its applications. Statistical Analysis and Data Mining, 2(5–6), 378–395. https://doi.org/10.1002/sam.10059 Zhang, W., Tang, X., & Yoshida, T. (2015). TESC: An approach to TExt classification using Semi-supervised Clustering. Knowledge-Based Systems, 75, 152–160. https://doi.org/10.1016/j.knosys.2014.11.028 Zhang, W., Xiao, F., Li, B., & Zhang, S. (2016). Using SVD on Clusters to Improve Precision of Interdocument Similarity Measure. Computational Intelligence and Neuroscience, 2016. https://doi.org/10.1155/2016/1096271 Zhang, Z. Z. Z., & Feng, X. F. X. (2009). New Methods for Deviation-Based Outlier Detection in Large Database. 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery, 1. https://doi.org/10.1109/FSKD.2009.303 Zhou, G., Zhao, J., Liu, K., & Cai, L. (2011). Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 1556–1565. Zhou, Y., Fleischmann, K. R., & Wallace, W. A. (2010). Automatic text analysis of values in the enron email dataset: Clustering a social network using the value patterns of actors. Proceedings of the Annual Hawaii International Conference on System Sciences, 1–10. https://doi.org/10.1109/HICSS.2010.77 Zweig, M. H., & Campbell, G. (1993). Receiver-operating characteristic (ROC) plots: A fundamental evaluation tool in clinical medicine. Clinical Chemistry, 39(4), 561–577. https://doi.org/ROC; Receiver-Operating Characteristic; SDT; Signal Detection Theory