Skip to content
printicon
Show report in:

UMINF 09.12

Finding, Extracting and Exploiting Structure in Text and Hypertext

Data mining is a fast-developing field of study, using computations to either predict or describe large amounts of data. The increase in data produced each year goes hand in hand with this, requiring algorithms that are more and more efficient in order to find interesting information within a given time.

In this thesis, we study methods for extracting information from semi-structured data, for finding structure within large sets of discrete data, and to efficiently rank web pages in a topic-sensitive way.

The information extraction research focuses on support for keeping both documentation and source code up to date at the same time. Our approach to this problem is to embed parts of the documentation within strategic comments of the source code and then extracting them by using a specific tool.

The structures that our structure mining algorithms are able to find among crisp data (such as keywords) is in the form of subsumptions, i.e. one keyword is a more general form of the other. We can use these subsumptions to build larger structures in the form of hierarchies or lattices, since subsumptions are transitive. Our tool has been used mainly as input to data mining systems and for visualisation of data-sets.

The main part of the research has been on ranking web pages in a such a way that both the link structure between pages and also the content of each page matters. We have created a number of algorithms and compared them to other algorithms in use today. Our focus in these comparisons have been on convergence rate, algorithm stability and how relevant the answer sets from the algorithms are according to real-world users.

The research has focused on the development of efficient algorithms for gathering and handling large data-sets of discrete and textual data. A proposed system of tools is described, all operating on a common database containing "fingerprints" and meta-data about items. This data could be searched by various algorithms to increase its usefulness or to find the real data more efficiently.

All of the methods described handle data in a crisp manner, i.e. a word or a hyper-link either is or is not a part of a record or web page. This means that we can model their existence in a very efficient way. The methods and algorithms that we describe all make use of this fact.

Keywords

Automatic propagation, CHiC, ProT, S²ProT, data mining, discrete data, extraction, hierarchies, rank distribution, spatial linking, web mining, web searching

Authors

Ola Ågren

Back Edit this report
Entry responsible: Account Deleted - might not work

Page Responsible: Frank Drewes
2020-07-04