DOIONLINE

DOIONLINE NO - IJAECS-IRAJ-DOIONLINE-5981

Publish In
International Journal of Advances in Electronics and Computer Science-IJAECS
Journal Home
Volume Issue
Issue
Volume-3,Issue-10  ( Oct, 2016 )
Paper Title
Unsupervised Query Result Extraction From Single Web Page
Author Name
Aleem Ansari, Hemlata Vasishtha
Affilition
Shri Venkateshwara University, Gajraula, India
Pages
39-43
Abstract
This paper presents the problem of extracting data from a Web page containing contiguous structured or semi structured data records also referred to as Object data (ODATA). One of the objectives is to identify the region containing contiguous ODATA also referred to as Data region or Object Region (OREG). Next we extract individual data items/fields for each ODATA and put them into XML file for further processing. This problem has been studied by several researchers. However, existing methods still have some serious limitations. These methods are either inaccurate, time consuming or make many assumptions. This paper proposes a novel technique to automate the task of retrieving individual ODATA from the Web page. It consists of three steps, 1) Predict the target OREG, 2) validate the OREG 3) Identify and extract the attributes of individual ODATA and put them into the XML file. This approach enables very accurate alignment and extraction of multiple ODATA. Experimental results using a large number of Web pages from diverse domains show that the proposed technique is able to segment ODATA (data records), align and extract data from them very accurately. Index Terms- Data Record Extraction, Information Extraction, Web Content Mining, Semi-Structured data.
  View Paper