Machine Learning for Web Page Classification: A Survey

By safae lassri, EL HABIB BENLAHMAR, Abderrahim TRAGHA

Abstract


The Internet contains a vast amount of data that is growing exponentially. To exploit this data, a Web information retrieval system and a categorization of internet content based on the classification of web pages are essential. Web page classification has many applications, among them the construction of web directories and the building of focused crawlers. In this paper, we present the characteristics of web page classification, we produce a literature review by summarizing and evaluating all sources related to web page classification crawled automatically from ScienceDirect and Springer websites, we review the different machine learning algorithms used to categorize web pages. Finally, we track the underlying assumptions behind the studied methods.

Full Text:

PDF

References


B. Choi et Z. Yao, « Web page classification », in Foundations and Advances in Data Mining, Springer, 2005, p. 221–274.

F. Sebastiani, « Machine learning in automated text categorization », ACM computing surveys (CSUR), vol. 34, no 1, p. 1–47, 2002.

X. Qi et B. D. Davison, « Web page classification: Features and algorithms », ACM computing surveys (CSUR), vol. 41, no 2, p. 12, 2009.

I. Charalampopoulos et I. Anagnostopoulos, « A comparable study employing weka clustering/classification algorithms for web page classification », in Informatics (PCI), 2011 15th Panhellenic Conference on, 2011, p. 235–239.

« 36 798 Search Results - Keywords(web page classification) - ScienceDirect ». [En ligne]. Disponible sur: https://www.sciencedirect.com/search?qs=web%20page%20classification&show=25&sortBy=relevance.

« Search Results - Springer ». [En ligne]. Disponible sur: https://link.springer.com/search?query=web+page+classification.

O.-W. Kwon et J.-H. Lee, « Web page classification based on k-nearest neighbor approach », in Proceedings of the fifth international workshop on on Information retrieval with Asian languages, 2000, p. 9–15.

Z. Xu, F. Yan, J. Qin, et H. Zhu, « A web page classification algorithm based on link information », in Distributed Computing and Applications to Business, Engineering and Science (DCABES), 2011 Tenth International Symposium on, 2011, p. 82–86.

B. Abdelbadie et B. Mohammed, « A Clique Based Web Page Classification Corrective Approach », in Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2014 IEEE/WIC/ACM International Joint Conferences on, 2014, vol. 2, p. 467–473.

Y. Zheng, C. Sun, C. Zhu, X. Lan, X. Fu, et W. Han, « LWCS: A large-scale web page classification system based on anchor graph hashing », in Software Engineering and Service Science (ICSESS), 2015 6th IEEE International Conference on, 2015, p. 90–94.

A. Belmouhcine et M. Benkhalifa, « Implicit Links-Based Techniques to Enrich K-Nearest Neighbors and Naive Bayes Algorithms for Web Page Classification », in Proceedings of the 9th International Conference on Computer Recognition Systems CORES 2015, vol. 403, R. Burduk, K. Jackowski, M. Kurzyński, M. Woźniak, et A. Żołnierek, Éd. Cham: Springer International Publishing, 2016, p. 755‑766.

« The Nature of Statistical Learning Theory | Vladimir N. Vapnik | Springer ». [En ligne]. Disponible sur: https://www.springer.com/la/book/9781475724400. [Consulté le: 26-avr-2018].

A. Khan, B. Baharudin, L. H. Lee, et K. Khan, « A review of machine learning algorithms for text-documents classification », Journal of advances in information technology, vol. 1, no 1, p. 4–20, 2010.

X. Wu et al., « Top 10 algorithms in data mining », Knowledge and information systems, vol. 14, no 1, p. 1–37, 2008.

A. Sun, E.-P. Lim, et W.-K. Ng, « Web classification using support vector machine », in Proceedings of the 4th international workshop on Web information and data management, 2002, p. 96–99.

M.-Y. Kan, « Web page classification without the web page », in Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, 2004, p. 262–263.

A. Belmouhcine et M. Benkhalifa, « Implicit links based kernel to enrich Support Vector Machine for web page classification », in Intelligent Systems: Theories and Applications (SITA), 2015 10th International Conference on, 2015, p. 1–4.

M. Gu, F. Zhu, Q. Guo, Y. Gu, J. Zhou, et W. Qu, « Towards effective web page classification », in Behavioral, Economic and Socio-cultural Computing (BESC), 2016 International Conference on, 2016, p. 1–2.

T. M. Mitchell, Machine Learning. New York: McGraw-Hill, 1997.

D. Mladenic, « Turning yahoo into an automatic web-page classifier », 1998.

H.-J. Oh, S. H. Myaeng, et M.-H. Lee, « A practical hypertext catergorization method using links and incrementally available class information », in Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, 2000, p. 264–271.

S. Sarode et J. Gadge, « Hybrid dimensionality reduction approach for web page classification », in Communication, Information & Computing Technology (ICCICT), 2015 International Conference on, 2015, p. 1–6.

C. Arya et S. K. Dwivedi, « News web page classification using url content and structure attributes », in Next Generation Computing Technologies (NGCT), 2016 2nd International Conference on, 2016, p. 317–322.

A. Alarabi et K. N. Mishra, « Artificial Neural Network Based Technique Compare with" GA" for Web Page Classification », in International Conference on Networked Digital Technologies, 2010, p. 699–705.

A. Selamat et S. Omatu, « Web page feature selection and classification using neural networks », Information Sciences, vol. 158, p. 69–88, 2004.

Y. Li, Y. Cao, Q. Zhu, et Z. Zhu, « A novel framework for web page classification using two-stage neural network », in International Conference on Advanced Data Mining and Applications, 2005, p. 499–506.

G. Khade, S. Kumar, et S. Bhattacharya, « Classification of web pages on attractiveness: A supervised learning approach », in Intelligent Human Computer Interaction (IHCI), 2012 4th International Conference on, 2012, p. 1–5.

W. Petprasit et S. Jaiyen, « E-commerce web page classification based on automatic content extraction », in Computer Science and Software Engineering (JCSSE), 2015 12th International Joint Conference on, 2015, p. 74–77.

F. Feng, Q. Zhou, Z. Shen, X. Yang, L. Han, et J. Wang, « The application of a novel neural network in the detection of phishing websites », Journal of Ambient Intelligence and Humanized Computing, avr. 2018.

G. Jain, M. Sharma, et B. Agarwal, « Optimizing semantic LSTM for spam detection », International Journal of Information Technology, vol. 11, no 2, p. 239‑250, juin 2019.

H. Yan, X. Zhang, J. Xie, et C. Hu, « Detecting Malicious URLs Using a Deep Learning Approach Based on Stacked Denoising Autoencoder », in Trusted Computing and Information Security, vol. 960, H. Zhang, B. Zhao, et F. Yan, Éd. Singapore: Springer Singapore, 2019, p. 372‑388.

D. López-Sánchez, A. G. Arrieta, et J. M. Corchado, « Visual content-based web page categorization with deep transfer learning and metric learning », Neurocomputing, vol. 338, p. 418‑431, avr. 2019.

D. López-Sánchez, A. G. Arrieta, et J. M. Corchado, « Deep neural networks and transfer learning applied to multimedia web mining », in Distributed Computing and Artificial Intelligence, 14th International Conference, vol. 620, S. Omatu, S. Rodríguez, G. Villarrubia, P. Faria, P. Sitek, et J. Prieto, Éd. Cham: Springer International Publishing, 2018, p. 124‑131.

S. R. Safavian et D. Landgrebe, « A survey of decision tree classifier methodology », IEEE transactions on systems, man, and cybernetics, vol. 21, no 3, p. 660–674, 1991.

J. A. Mangai, D. S. Kothari, et V. S. Kumar, « A supervised discretization algorithm for web page classification », in Innovations in Information Technology (IIT), 2012 International Conference on, 2012, p. 226–231.

H. Jelodar, S. J. Mirabedini, et A. Harounabadi, « Evaluation and Analysis of Popular Decision Tree Algorithms for Annoying Advertisement Websites Classification », in Communication Systems and Network Technologies (CSNT), 2015 Fifth International Conference on, 2015, p. 1025–1029.

X.-Y. Lu, M.-S. Chen, J.-L. Wu, P.-C. Chang, et M.-H. Chen, « A novel ensemble decision tree based on under-sampling and clonal selection for web spam detection », Pattern Analysis and Applications, p. 1–14.

M. Andoohgin Shahri, M. D. Jazi, G. Borchardt, et M. Dadkhah, « Detecting Hijacked Journals by Using Classification Algorithms », Science and Engineering Ethics, avr. 2017.

M. Dadkhah, T. Sutikno, M. Davarpanah Jazi, et D. Stiawan, « An Introduction to Journal Phishings and Their Detection Approach », TELKOMNIKA (Telecommunication Computing Electronics and Control), vol. 13, no 2, p. 373, juin 2015.

M. Dadkhah et G. Borchardt, « Hijacked Journals: An Emerging Challenge for Scholarly Publishing », Aesthetic Surgery Journal, vol. 36, no 6, p. 739‑741, juin 2016.

V. Estruch, C. Ferri, J. Hernández-Orallo, et M. J. Ramírez-Quintana, « Web Categorisation Using Distance-Based Decision Trees », Electronic Notes in Theoretical Computer Science, vol. 157, no 2, p. 35‑40, mai 2006.

J.-S. R. Jang, C.-T. Sun, et E. Mizutani, Neuro-fuzzy and soft computing: a computational approach to learning and machine intelligence. Upper Saddle River, NJ: Prentice Hall, 1997.

C.-F. Tsai, Y.-F. Hsu, C.-Y. Lin, et W.-Y. Lin, « Intrusion detection by machine learning: A review », Expert Systems with Applications, vol. 36, no 10, p. 11994‑12000, déc. 2009.

A. D. Patel et Y. K. Sharma, « Web Page Classification on News Feeds Using Hybrid Technique for Extraction », in Information and Communication Technology for Intelligent Systems, vol. 107, S. C. Satapathy et A. Joshi, Éd. Singapore: Springer Singapore, 2019, p. 399‑405.

H. Zuhair et A. Selamat, « Phishing Hybrid Feature-Based Classifier by Using Recursive Features Subset Selection and Machine Learning Algorithms », in Recent Trends in Data Science and Soft Computing, vol. 843, F. Saeed, N. Gazem, F. Mohammed, et A. Busalim, Éd. Cham: Springer International Publishing, 2019, p. 267‑277.

A. Selamat et C. C. Ng, « Arabic script web page language identifications using decision tree neural networks », Pattern Recognition, vol. 44, no 1, p. 133‑144, janv. 2011.

G. Salton, A. Wong, et C. S. Yang, « A vector space model for automatic indexing », Communications of the ACM, vol. 18, no 11, p. 613‑620, nov. 1975.

S. Adam et B. Horst, Graph-theoretic techniques for web content mining, vol. 62. World Scientific, 2005.

J. Kittler, M. Hater, et R. P. Duin, « Combining classifiers », in Proceedings of 13th international conference on pattern recognition, 1996, vol. 2, p. 897–901.

B. L. J. Chuan, M. M. Singh, et A. R. M. Shariff, « APTGuard: Advanced Persistent Threat (APT) Detections and Predictions using Android Smartphone », in Computational Science and Technology, Springer, 2019, p. 545–555.

X.-Y. Lu, M.-S. Chen, J.-L. Wu, P.-C. Chang, et M.-H. Chen, « A novel ensemble decision tree based on under-sampling and clonal selection for web spam detection », Pattern Analysis and Applications, p. 1–14, 2018.

A. Makkar et S. Goel, « Spammer Classification Using Ensemble Methods over Content-Based Features », in Proceedings of Sixth International Conference on Soft Computing for Problem Solving, 2017, p. 1–9.

F. Elsalmy, R. Ismail, et W. AbdelMoez, « Enhancing Web Page Classification Models », in Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2016, vol. 533, A. E. Hassanien, K. Shaalan, T. Gaber, A. T. Azar, et M. F. Tolba, Éd. Cham: Springer International Publishing, 2017, p. 742‑750.

L. Safae, B. E. Habib, et T. Abderrahim, « A Review of Machine Learning Algorithms for Web Page Classification », in 2018 IEEE 5th International Congress on Information Science and Technology (CiSt), 2018, p. 220‑226.

D. Boley et al., « Document categorization and query generation on the world wide web using webace », Artificial Intelligence Review, vol. 13, no 5‑6, p. 365–391, 1999.

M. Craven, A. McCallum, D. PiPasquo, T. Mitchell, et D. Freitag, « Learning to extract symbolic knowledge from the World Wide Web », Carnegie-mellon univ pittsburgh pa school of computer Science, 1998.