Query-enrichment based methods   start by enriching user queries to a collection of text documents through search engines. Thus, each query is represented by a pseudo-document which consists of the snippets of top ranked result pages retrieved by search engine. Subsequently, the text documents are classified into the target categories using synonym based classifier or statistical classifiers, such as Naive Bayes NB and Support Vector Machines SVMs.
How to adapt the changes of the queries and categories over time? Therefore, the old labeled training queries may be out-of-data and useless soon. How to make the classifier adaptive over time becomes a big issue.
For example, the word "Barcelona" has a new meaning of the new micro-processor of AMD, while it refers to a city or football club before The distribution of the meanings of this term is therefore a function of time on the Web. The recently introduced graph CNNs can work on both dense gridded structures as well as generic graphs.
Graph CNNs have been performing at par with traditional CNNs on tasks such as point cloud classification and segmentation, protein classification and image classification, while reducing the complexity of the network.
Performance can be improved by integrating sequence information into the classification process. This approach has already been followed in the past, as some classification algorithms already integrate sequence into their model, making them viable candidates for this task. CRFs tend to form the base of many existing meta-data extraction approaches [ 2 ][ 3 ][ 6 ]. Classification algorithms, which do not have a native support for sequences, can be enhanced to improve the classification results by additional logic.
This approach is followed by meta-data extraction tools employed by Mendeley Desktop and CiteSeer platform. The TeamBeam algorithm uses a Maximum Entropy [ 1 ] classifier, which is enhanced by the Beam Search [ 8 ] for the sequence classification task. This combination has already been applied in the area of Natural Language Processing to label sequences of words. In terms of feature types, the TeamBeam algorithm follows existing approaches and integrates layout and formatting information as well as employing common name lists.
It goes beyond existing approaches by creating language models [ 7 ] during the training phase. A demonstration of the algorithm can be accessed online Figure 1.
The source of the algorithm is available under an open-source license, which can an be accessed via the TeamBeam web-page. Figure 1: Example of a scientific article together with the output of the TeamBeam algorithm.
Meta-data has been extracted and also annotated in the preview image of the article. Input: The starting point for meta-data extraction is a set of text blocks, which are provided by an open-source tool, which can also be accessed via the TeamBeam web-page, that is built upon the output of the PDFBox library. These blocks are generated from parsing scientific articles and organising the text into words, lines and then text blocks. To identify these text blocks, layout and formatting information is exploited, where a text block is a list of vertically adjacent lines which share the same font size.
In the first phase the input text blocks are classified and a sub-set of the meta-data are derived from this result. The text blocks related to the author meta-data are then fed into another classification phase. In the second phase the individual words within the text blocks are classified. Meta-Data Types: The goal of the TeamBeam algorithm is to extract a rich set of meta-data from scientific articles: i The title of the scientific article; ii The optional sub-title, which is only present for a fraction of all available articles; iii The name of the journal, conference or venue; iv The abstract of the article, which might span a number of paragraphs; v The names of the authors; vi The e-mail addresses of the authors; vii The affiliation i.
Classification Algorithm: A supervised machine learning algorithm lies at the core of the meta-data extraction process. The open-source library OpenNLP provides a set of classification algorithms tailored towards the classification of sequences.
Its main algorithm is based on the Maximum Entropy classifier [ 1 ], which by itself does not take sequence into account. In order to integrate the sequence information, a Beam Search approach [ 8 ] is followed.
Beam Search takes the classification decision of preceding instances into account to improve overall performance and to rule out any unlikely label sequences.
Feature Types: Classification algorithms are capable of dealing with a diverse set of feature types. The TeamBeam algorithm is restricted to binary features. Therefore, all continuous or categorical information needs to be mapped to features with binary values. The features used for classification are derived from the layout, the formatting, the words within and around a text block, and common name lists. A language model is created at the beginning of the training phase to improve the text block classification.
Classifier Systems and Genetic Algorithms. Artificial Intelligence — Google Scholar Brazdil Pavel, B. Data Transformation and model selection by experimentation and meta-learning. Technical University of Chemnitz. Google Scholar Breiman Bagging Predictors. Google Scholar Brodley Carla Machine Learning Creating and Exploiting Coverage and Diversity. Portland, Oregon. Google Scholar Bruha Ivan A feedback loop for refining rule qualities in a classifier: a reward-penalty strategy.
Google Scholar Caruana Rich Multitask Learning. Second Special Issue on Inductive Transfer. Machine Learning 41— Google Scholar Chan Philip, K. In Kerschberg, L. Journal of Intelligent Integration of Information. Experiments on Multistrategy Learning by Meta-Learning. Learning and Inductive Inference. Special issue on bias evaluation and selection. Evaluation and Selection of Biases in Machine Learning. Machine Learning 5— Google Scholar Domingos Pedro Morgan Kaufmann, Nashville TN.
Knowledge Discovery Via Multiple Models. Intelligent Data Analysis 2: — Stefan Institute Publisher, Ljubljana. Google Scholar Freund, Y. Experiments with a new boosting algorithm. Morgan Kaufman, Bari, Italy.
Applications[ edit ] Metasearch engines send a user's query to multiple search engines and blend the top results from each into one overall list. Google Scholar Caruana Rich Learning to Learn 8: — CNNs have helped improve performance on many difficult video and image understanding tasks but are restricted to dense gridded structures. Meta-data plays an important role in providing services to retrieve and organise the articles. This is motivated by the need to separate two different affiliations, which are written in a sequence.