Friday 3rd of May 2024
 

Performance Analysis of Vision-Based Deep Web Data Extraction for Web Document Clustering


M. Lavanya and Usha Rani

Web Data Extraction is a critical task by applying various scientific tools and in a broad range of application domains. To extract data from multiple web sites are becoming more obscure, as well to design of web information extraction systems becomes more complex and time-consuming. We also present in this paper so far various risks in web data extraction. Identifying data region from web is a noteworthy crisis for information extraction from the web page. In this paper, performance of vision-based deep web data extraction for web document clustering is presented with experimental result. The proposed approach comprises of two phases: 1) Vision-based web data extraction, where output of phase I is given to second phase and 2) web document clustering. In phase 1, the web page information is segmented into various chunks. From which, surplus noise and duplicate chunks are removed using three parameters, such as hyperlink percentage, noise score and cosine similarity. To identify the relevant chunk, three parameters such as Title word Relevancy, Keyword frequency-based chunk selection, Position features are used and then, a set of keywords are extracted from those main chunks. Finally, the extracted keywords are subjected to web document clustering using Fuzzy c-means clustering (FCM). The experimentation has been performed on two different datasets and the results showed that the proposed VDEC method can achieve stable and good results of about 99.2% and 99.1% precision value in both datasets.

Keywords: Features, risks, problems, VDEC, Framework, Position features, Fuzzy c-means clustering (FCM)

Download Full-Text


ABOUT THE AUTHORS

M. Lavanya
Assistant Professor [SL], Department of Master of Computer Applications

Usha Rani
Associate Professor, Department of Computer Science


IJCSI Published Papers Indexed By:

 

 

 

 
+++
About IJCSI

IJCSI is a refereed open access international journal for scientific papers dealing in all areas of computer science research...

Learn more »
Join Us
FAQs

Read the most frequently asked questions about IJCSI.

Frequently Asked Questions (FAQs) »
Get in touch

Phone: +230 911 5482
Email: info@ijcsi.org

More contact details »