SDRS: a new lossless dimensionality reduction for text corpora

Iñaki Velez de Mendizabal; Vitor Manuel Basto Fernandes; Enaitz Ezpeleta; Jose R. Mendez; Urko Zurutuza

Ciência_Iscte Publications Publication Detailed Description

Scientific journal paper Q1

SDRS: a new lossless dimensionality reduction for text corpora

Iñaki Velez de Mendizabal (De Mendizabal, I. V.); Vitor Manuel Basto Fernandes (Basto-Fernandes, V.); Enaitz Ezpeleta (Ezpeleta, E,); Jose R. Mendez (Méndez, J. R.); Urko Zurutuza (Zurutuza, U.);

Journal Title

Information Processing and Management

Year (definitive publication)

2020

Language

English

Country

United Kingdom

More Information

Visit Link

Web of Science®

Times Cited: 8

(Last checked: 2026-06-24 21:24)

View record in Web of Science®

Article Impact Index: 0.2

Scopus

Times Cited: 8

(Last checked: 2026-06-19 12:18)

View record in Scopus

Article Impact Index: 0.1

Google Scholar

Times Cited: 9

(Last checked: 2026-06-24 19:20)

View record in Google Scholar

Overton

This publication is not indexed in Overton

Abstract

In recent years, most content-based spam filters have been implemented using Machine Learning (ML) approaches by means of token-based representations of textual contents. After introducing multiple performance enhancements, the impact has been virtually irrelevant. Recent studies have introduced synset-based content representations as a reliable way to improve classification, as well as different forms to take advantage of semantic information to address problems, such as dimensionality reduction. These preliminary solutions present some limitations and enforce simplifications that must be gradually redefined in order to obtain significant improvements in spam content filtering. This study addresses the problem of feature reduction by introducing a new semantic-based proposal (SDRS) that avoids losing knowledge (lossless). Synset-features can be semantically grouped by taking advantage of taxonomic relations (mainly hypernyms) provided by BabelNet ontological dictionary (e.g. “Viagra” and “Cialis” can be summarized into the single features “anti-impotence drug”, “drug” or “chemical substance” depending on the generalization of 1, 2 or 3 levels). In order to decide how many levels should be used to generalize each synset of a dataset, our proposal takes advantage of Multi-Objective Evolutionary Algorithms (MOEA) and particularly, of the Non-dominated Sorting Genetic Algorithm (NSGA-II). We have compared the performance achieved by a Naïve Bayes classifier, using both token-based and synset-based dataset representations, with and without executing dimensional reductions. As a result, our lossless semantic reduction strategy was able to find optimal semantic-based feature grouping strategies for the input texts, leading to a better performance of Naïve Bayes classifiers.

Acknowledgements

Keywords

Spam filtering,Token-based representation,Synset-based representation,Semantic-based feature reduction,Multi-objective evolutionary algorithms

Fields of Science and Technology Classification

Computer and Information Sciences - Natural Sciences

Funding Records

Funding Reference	Funding Entity
UIDP/04466/2020	Fundação para a Ciência e a Tecnologia
UIDB/04466/2020	Fundação para a Ciência e a Tecnologia

Contributions to the Sustainable Development Goals of the United Nations

With the objective to increase the research activity directed towards the achievement of the United Nations 2030 Sustainable Development Goals, the possibility of associating scientific publications with the Sustainable Development Goals is now available in Ciência_Iscte. These are the Sustainable Development Goals identified by the author(s) for this publication. For more detailed information on the Sustainable Development Goals, click here.

Publication Identifiers

Scopus (source: Ciência_Iscte)	2-s2.0-85081988881
DOI (source: author)	10.1016/j.ipm.2020.102249
WoS (source: Ciência_Iscte)	WOS:000531082800020
Scopus (source: author)	2-s2.0-85081988881
Ciência_Iscte ID	ci-pub-70824
Handle (source: Ciência-IUL)	http://hdl.handle.net/10071/20406

Other Publication Details

Online Publication Year	2020
Publisher	Elsevier
Indexes	Web of Science©; Scopus;
ISSN	0306-4573 (print) 1873-5371 (online)
ISBN	--
Impact Factor	--
Volume	57	Number	4
Series
Article Number	102249
Pages	--
Peer Reviewed	Yes
ISCTE-IUL Repository	Link to the repository
Publication Date (online)	2020-03-21
Publication Date (print)

Altmetric

Dimensions

PlumX Metrics