Utilize este identificador para referenciar este registo: http://hdl.handle.net/10400.6/3714
Título: Classification of opinionated texts by analogy
Autor: Pais, Sebastião
Orientador: Dias, Gaël Harry Adélio André
Palavras-chave: Páginas Web -- Avaliação da qualidade
Páginas Web -- Avaliação do conteúdo
Linguagem natural -- Processamento
Opinion mining
Opinion retrieval
Sentiment analysis
Recuperação da informação -- Web -- Avaliação
Data de Defesa: 2008
Resumo: With the disproportionate increase of theWorldWideWeb and the quantity of information services and their availability, we have an excessive accumulation of documents of various kinds. Despite the positive aspects this represents and the potential this causes, a new problem arises as we need capable tools and methodologies to classify a document as to its quality. Assessing the quality of a Web page is not easy. For the technical evaluation of the structure of Web pages, many are the works that have emerged. This thesis follows a different course. It seeks to evaluate the content of pages according to the opinions and feelings they highlight. The adopted basis criterion to assess the quality ofWeb pages is to examine the absence of opinions and feelings in the texts. When we consult information from the Web, how do we know exactly that the information is reliable and does not express opinions which are made available to the public feelings? How can we ensure when we read a text that we are not being misled by the author who is expressing his opinion or, once again, his feelings? How can we ensure that our own assessment is free from any judgment of value that we can defend? Because of these questions, the area of "Opinion Mining", "Opinion Retrieval", or "Sentiment Analysis", is worth being investigated as we clearly believe that there is much to discover yet. After a lot of research and reading, we concluded that we do not want to follow the same methodology proposed so far by other researchers. Basically, they work with objective and subjective corpora manually annotated. We think it is a disadvantage because these are limited corpora, once they are small, and cover a limited number of subjects. We disagree with another point. Some researchers only use one or several morphological classes, or specific words as predefined attributes. As we want to identify the degree of objectivity/subjectivity of sentences, and not documents, the more attributes we will have, the more accurate we expect our classification to be. We want to implement another innovation in our method. We want to make it as automatic as possible or, at least, the least supervised as possible. Assessed some gaps in the area, we define our line of intervention for this dissertation. As already mentioned, as a rule, the corpora used in the area of opinions are manually annotated and they are not very inclusive. To tackle this problem we propose to replace these corpora with texts taken from Wikipedia and texts extracted from Weblogs, accessible to any researcher in the area. Thus, Wikipedia should represent objective texts and Weblogs represent subjective texts (which we can consider that is an opinion repository). These new corpora bring great advantages. They are obtained in an automatic way, they are not manually annotated, we can build them at any time and they are very inclusive. To be able to say that Wikipedia may represent objective texts and Weblogs may represent subjective texts, we assess their similarity at various morphological levels, with manually annotated objective/subjective corpora. To evaluate this similarity, we use two different methodologies, the Rocchio Method and the Language Model on a cross-validation basis. By using these two different methodologies, we achieve similar results which confirm our hypothesis. With the success of the step described above, we propose to automatically classify sentences (at various morphological levels) by analogy. At this stage, we use different SVM classifiers and training and test sets built over several corpora on a cross-validation basis, to, once again, have several results to compare to draw our final conclusions. This new concept of quality assessment of a Web page, through the absence of opinions, brings to the scientific community another way of research in the area of opinions. The user in general is also benefited, because he has the chance, when he consults a Web page or uses a search engine, to know with some certainty if the information is true or if this is only one set of opinions/sentiments expressed by the authors, excluding thus their own judgments of value about what he sees.
URI: http://hdl.handle.net/10400.6/3714
Designação: Dissertação apresentada à Universidade da Beira Interior para a obtenção do grau de mestre em Engenharia Informática
Aparece nas colecções:FE - DI | Dissertações de Mestrado e Teses de Doutoramento

Ficheiros deste registo:
Ficheiro Descrição TamanhoFormato 
capa_da_tese.pdf137,64 kBAdobe PDFVer/Abrir
tese.pdf3,84 MBAdobe PDFVer/Abrir

FacebookTwitterDeliciousLinkedInDiggGoogle BookmarksMySpace
Formato BibTex MendeleyEndnote 

Todos os registos no repositório estão protegidos por leis de copyright, com todos os direitos reservados.