Effective query structuring with ranking using named entity categories for XML retrieval
A large number of documents are now represented and stored using an XML document structure on the web. Thus, there is a need for effective and user-friendly search systems for XML document search. Query languages are largely used to compose structured queries by users to extract data from XML doc...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2016
|
Subjects: | |
Online Access: | http://psasir.upm.edu.my/id/eprint/69351/1/FSKTM%202016%2018%20IR.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | A large number of documents are now represented and stored using an XML document
structure on the web. Thus, there is a need for effective and user-friendly search
systems for XML document search. Query languages are largely used to compose
structured queries by users to extract data from XML documents. However, using
query languages to express queries prove to be difficult for most users since this
requires learning a query language and knowledge of the underlying data schema. On
the other hand, the success of Web search engines has made many users to be familiar
with keyword search and therefore prefer to use a keyword search query interface to
search XML data. Keyword queries are inherently ambiguous and it is difficult for
users to clearly state their intentions, which causes keyword search systems to
inevitably return irrelevant results, making search engines less effective. Therefore, to
improve the effectiveness of search engines, keyword search systems are highly
needed.
Query structuring system is one of the keyword search systems recently used for
effective retrieval of XML documents. The systems focus on user query
representation, user search intention identification and ranking algorithms to improve
keyword search. However, firstly, existing systems return wrong query representation
because of their inability to put keyword query ambiguity problems into consideration
during query pre-processing. For example, none of the systems consider the following
ambiguities: (i) a query term can appear as the text values of different XML nodes
and having different semantics (ii) a query term can appear as both a tag name and as
part of text content of some node. Secondly, the systems return wrong user search
intention. Specifically, the systems return irrelevant predicates as well as noninformative
entity nodes. Thirdly, the systems fail to generate and select best
structured query that match a user input keyword query. Finally, the systems' ranking
functions ignore to consider the semantics of XML tags into account which leads to
irrelevant results. These problems are addressed as follows: Firstly, an enrichment method has been proposed to investigate whether enriching
document content with semantic tags improves the performance of keyword queries.
The method employs Semantic Tags Extraction (STSE) algorithm to extract semantic
tags of an element and Element Enrichment (EERM) algorithm to enrich the elements.
Secondly, a XML Keyword Query Structuring System (XKQSS) has been developed
to relegate the task of generating structured queries from a user to itself while retaining
the simple keyword search query interface that allows users to submit a schema
independent keyword query. The XKQSS uses a Semantic Aware Index scheme
(SAIS) to record the proportion of Named Entity Categories (NECs) and an Entity
based Query Segmentation (EBQS) method to interpret the user query as a list of
keywords and named entities (resolves ambiguity). Furthermore, it employs Predicates
Identification Algorithm (PIA) and Entity Identification Algorithm (EIA) to identify
user search intention. Finally, the system utilizes a query formulation algorithm
(QRYF) to select the structured queries that best interpret user query.
Thirdly, a modification to XKQSS called Ranking Aware XML Keyword Query
Structuring System (RAXKQSS) has been developed to effectively return a ranked
list of elements as answer to a user query. The RAXKQSS, first, introduces an
improve SAIS (ISAIS) to record the Named Entity Category (NEC) of each indexed
term, in addition to the usual information such as term frequencies, term position, as
well as element that contains the term in the inverted index. Then, the system uses a
ranking function rk_BM25TOPF to assign relevance scores to XML fragments with
respect to a query and an N-gram based Query Segmentation (NBQS) method to
interpret the user query as a list of N-grams (resolves ambiguity). Next, it introduces
an Improved PIA (IPIA) and a Compute Return Node Algorithm (CRNA) to return
relevant predicates and return node, respectively. Finally, the system employs a query
formulation via node algorithm (QRYFv) algorithm to improve the selection of
structured queries that best match user query
Experiments have been conducted to evaluate the performance of the proposed
enrichment method, XKQSS and RAXKQSS. The experimental results have shown
that the enrichment method has an insignificant improvement compared with the
baseline in terms of Mean Average Precision (MAP). The results also demonstrated
that the propose XKQSS outperforms XReal and StruX in terms of precision.
Moreover, the results also illustrated that the proposed RAXKQSS achieved higher
precision when compared with the StruX, the SLCA.
These results have shown that the enrichment method is ineffective in improving
retrieval performance while the proposed systems XKQSS and RAXKQSS have
proved effective compared to the StruX and the SLCA in terms of retrieval
performance. |
---|