Paper Abstracts

Notice that the final version of the IIR 2014 Proceedings will be released under CEUR after a prost-Conference editorial process coordinated by the Program Chairs, Fabio Crestani (University of Lugano) and Marco Pennacchiotti (eBay).

Evaluation of a Recursive Weighting Scheme for Federated Web Search

Emanuele Di Buccio, Ivano Masiero and Massimo Melucci

Abstract. The informative resources available on the Web are not always directly accessible and cannot therefore be crawled since access is permitted only through the adoption of appropriate services, e.g. specialized search engines. On the other hand, specialized search engines can help address the problem of heterogeneity of the informative resources due to the type of content, the structure or the media. Federated Web Search systems address the problem of searching multiple, heterogeneous, and possibly uncooperative collections. One issue of Federated Web Search is resource selection, i.e. the selection of the search engines which most likely provide documents relevant to the query. This paper reports on the experimental evaluation in Federated Web Search setting of a recursive weighting scheme for ranking informative resources in architectures that involve an arbitrary number of resource levels.

The Axiometrics Project

Eddy Maddalena and Stefano Mizzaro

Abstract. The evaluation of retrieval effectiveness has played and is playing a central role in Information Retrieval (IR). To evaluate the effectiveness of IR systems, more than 50 (maybe 100) different evaluation metrics have been proposed. In this paper we sketch our Axiometrics project, that aims to a formal account of IR effectiveness metrics.

Can We Infer Book Classification by Blurbs?

Valentina Poggioni, Valentina Franzoni and Fabiana Zollo

Abstract. The aim of this work is to study the feasibility of an auto- mated classification of books in the social network Zazie by means of the lexical analysis of book blurbs. A supervised learning approach is used to determine if a correlation between the characteristics of a book blurb and the emotional icons associated to the book by the Zazie’s users exists.

Top-N Recommendations from Implicit Feedback leveraging Linked Open Data

Vito Claudio Ostuni, Tommaso Di Noia, Roberto Mirizzi and Eugenio Di Sciascio

Abstract. In this paper we present SPrank, a novel hybrid recommendation algorithm able to compute top-N item recommendations from implicit feedback exploiting the information available in the so called Web of Data. We leverage DBpedia, a well-known knowledge base in the LOD (Linked Open Data) compass, to extract semantic path-based features and to eventually compute recommendations using a learning to rank algorithm. Experiments with datasets on two different domains show that the proposed approach outperforms in terms of prediction accuracy several state-of-the-art top-N recommendation algorithms for implicit feedback in situations affected by different degrees of data sparsity.

Detection of Similar Terrorist Events

Vittoria Cozza and Michelangelo Rubino

Abstract. Event counting is significant when it allows us to discover and represent implicit knowledge. We realize that a particular event happens somewhere not just by mere chance, it is unlikely to be what we call as accidental event. E.g. the number of violent attacks and terrorist acts can give the measure of the safety for a given country and can help us to predict where and/or when similar events are likely to happen next time. This work proposes an approach for detecting terrorist events sharing common details, available from open datasets, with the aim of merging their descriptions and counting them exactly. Events are aggregated according to a space-time-textual similarity function.

Developing a Semantic Content Analyzer for L’Aquila Social Urban Network

Cataldo Musto, Giovanni Semeraro, Pasquale Lops, Marco de Gemmis, Fedelucio Narducci, Luciana Bordoni, Mauro Annunziato, Claudia Meloni, Franco F. Orsucci and Giulia Paoloni

Abstract. This paper presents the preliminary results of a joint research project about Smart Cities. This project is adopting a multidisciplinary approach that combines artificial intelligence techniques with psychology research to monitor the current state of the city of L’Aquila after the dreadful earthquake of April 2009. This work focuses on the description of a semantic content analysis module. This component, integrated into L’Aquila Social Urban Network (SUN), combines Natural Language Processing (NLP) and Artificial Intelligence (AI) to deeply analyze the content produced by citizens on social platforms in order to map social data with social indicators such as cohesion, sense of belonging and so on. The research carries on the insight that social data can supply a lot of information about latent people feelings, opinion and sentiments. Within the project, this trustworthy snapshot of the city is used by community promoters to proactively propose initiatives aiming at empowering the social capital of the city and recovering the urban structure which has been disrupted after the ’diaspora’ of citizens in the so called ”new towns”.

Sentiment Estimation on Twitter

Giambattista Amati, Marco Bianchi and Giuseppe Marcone

Abstract. We study the classifier quantification problem in the context of the topical opinion retrieval, that consists in estimating proportions of the sentiment categories in the result set of a topic. We propose a methodology to circumvent individual classification allowing a real-time sentiment analysis for huge volumes of data. After discussing existing approaches to quantification, the novel proposed methodology is applied to Microblogging Retrieval and provides statistically significant estimates of sentiment category proportions. Our solution modifies Hopkins and King’s approach in order to remove manual intervention, and making sentiment analysis feasible in real time. Evaluation is conduced with a test collection made up of about 3,2M tweets.

Enabling Enterprise Semantic Search through Language Technologies: the Progress-It experience

Roberto Basili, Andrea Ciapetti, Danilo Croce, Valeria Marino, Paolo Salvatore and Valerio Storch

Abstract. This paper presents the platform targeted in the PROGRESS-IT project. It represents an Enterprise Semantic Search engine tailored for Small and Medium Sized Enterprises to retrieve information about Projects, Grants, Patents or Scientific Papers. The proposed solution improves the usability and quality of standard search engines through Distributional models of Lexical Semantics. The quality of the Keyword Search has been improved with Query Suggestion, Expansion and Result Re-Ranking. Moreover, the interaction with the system has been specialized for the analysts by defining a set of Dashboards designed to enable richer queries avoiding the complexity of their definition. This paper shows the application of Linguistic Technologies, such as the Structured Semantic Similarity function to measure the relatedness between documents. These are then used in the retrieval process, for example to ask the system for Project Ideas directly using an Organization Description as a query. The resulting system is based on Solr, inheriting its highly reliability, scalability and fault tolerance, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more.

Exploiting Wikipedia to Identify Domain-Specific Key Terms/Phrases from a Short-Text Collection

Muhammad Atif Qureshi, Colm O’Riordan and Gabriella Pasi

Abstract. Extracting from a given document collection what we call “domain-specific” key terms/phrases is a challenging task. By “domain-specific” key terms/phrases we mean words/expressions representative of the topical areas specific to the focus of a document collection. For example, when a collection is related to academic research (i.e., its focus is related to topics dealing with academic research), the domain-specific key terms/phrases could be ‘Information Retrieval’, ‘Marine Biology’, ‘Science’, etc. In this contribution a technique for identifying domain-specific key terms/phrases from a collection of documents is proposed. The proposed technique works on short textual descriptions, and it makes use of the titles of Wikipedia articles and of the Wikipedia category graph. We performed some experiments over the document collection (html title text only) of eight post-graduate school Web sites of five different countries. The evaluations show promising results for the identification of domain-specific key terms/phrases.

On the Effects of Low-Quality Training Data on Information Extraction from Clinical Reports

Diego Marcheggiani and Fabrizio Sebastiani

Abstract. In the last five years there has been a flurry of work on information extraction from clinical documents, i.e., on algorithms capable of extracting, from the informal and unstructured texts that are generated during everyday clinical practice, mentions of concepts relevant to such practice. Most of this literature is about methods based on supervised learning, i.e., methods for training an information extraction system from manually annotated examples. While a lot of work has been devoted to devising learning methods that generate more and more accurate information extractors, little work (if any) has been devoted to investigating the effect of the quality of training data on the learning process. Low quality in training data sometimes derives from the fact that the person who has annotated the data is different (e.g., more junior) from the one against whose judgment the automatically annotated data must be evaluated. In this paper we test the impact of such data quality issues on the accuracy of information extraction systems oriented to the clinical domain. We do this by comparing the accuracy deriving from training data annotated by the authoritative coder (i.e., the one who has annotated the test data), with the accuracy deriving from training data annotated by a different coder. The results indicate that, although the disagreement between the two coders (as measured on the training set) is substantial, the difference in accuracy is not so. This hints at the fact that current learning technology is robust to the use of training data of suboptimal quality.

LearNext: Learning to Predict Tourists Movements

Ranieri Baraglia, Cristina Ioana Muntean, Franco Maria Nardini and Fabrizio Silvestri

Abstract. In this paper, we tackle the problem of predicting the “next” geographical position of a tourist given her history (i.e., the prediction is done accordingly to the tourist’s current trail) by means of supervised learning techniques, namely Gradient Boosted Regression Trees and Ranking SVM. The learning is done on the basis of an object space represented by a 68 dimension feature vector, specifically designed for tourism related data. Furthermore, we propose a thorough comparison of several methods that are considered state-of-the-art in touristic recommender and trail prediction systems as well as a strong popularity baseline. Experiments show that the methods we propose outperform important competitors and baselines thus providing strong evidence of the performance of our solutions.

An Investigation into the Correlation between Willingness for Web Search Personalization and SNS Usage Patterns

Arjumand Younus, Colm O’Riordan and Gabriella Pasi

Abstract. This paper presents a user survey-based analysis of the correlation between the users’ willingness to personalize Web search and their social network usage patterns. The participants’ responses to the survey questions enabled us to use a regression model for identifying the relationship between SNS variables and willingness to personalize Web search; the obtained results show that there is a strong relationship be- tween willingness for personalized Web search and social network usage patterns. Finally, based on the findings of our survey we present some implications and directions for future work.