First Advisor

William Comer

Date of Award

Summer 2020

Document Type

Thesis

Degree Name

Bachelor of Science (B.S.) in Computer Science and University Honors

Department

Computer Science

Language

English

Subjects

Natural language processing (Computer science), Computational linguistics, Information storage and retrieval systems, Scholarly periodicals -- Russia -- Analysis

DOI

10.15760/honors.957

Abstract

The automatic extraction of keyphrases from scholarly papers is a necessary step for many Natural Language Processing (NLP) tasks, including text retrieval, machine translation, and text summarization. However, due to the different grammatical and semantic intricacies of languages, this is a highly language-dependent task. Many free and open source implementations of state-of-the-art keyphrase extraction techniques exist, but they are not adapted for processing Russian text. Furthermore, the multi-linguistic character of scholarly papers in the field of Russian computational linguistics and NLP introduces additional complexity to keyphrase extraction. This paper describes a free and open source program as a proof of concept for a topic-clustering approach to the automatic extraction of keyphrases from the largest conference on Russian computational linguistics and intellectual technologies, Dialogue. The goal of this paper is to use LDA and pyLDAvis to discover the latent topics of the Dialogue conference and to extract the salient keyphrases used by the research community. The conclusion points to needed improvements to techniques for PDF text extraction, morphological normalization, and candidate keyphrase ranking.

Rights

In Copyright. URI: http://rightsstatements.org/vocab/InC/1.0/ This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).

Comments

The code used in this paper is free and open source, and can be accessed through GitHub.

Persistent Identifier

https://archives.pdx.edu/ds/psu/33890

Recommended Citation

Wienecke, Yves, "Automatic Keyphrase Extraction From Russian-Language Scholarly Papers in Computational Linguistics" (2020). University Honors Theses. Paper 935.
https://doi.org/10.15760/honors.957

Download

Included in

Computational Linguistics Commons, Computer Sciences Commons, Russian Linguistics Commons

COinS

University Honors Theses

Automatic Keyphrase Extraction From Russian-Language Scholarly Papers in Computational Linguistics

First Advisor

Date of Award

Document Type

Degree Name

Department

Language

Subjects

DOI

Abstract

Rights

Comments

Persistent Identifier

Recommended Citation

Included in

Find

Connect

University Honors Theses

Automatic Keyphrase Extraction From Russian-Language Scholarly Papers in Computational Linguistics

Author

First Advisor

Date of Award

Document Type

Degree Name

Department

Language

Subjects

DOI

Abstract

Rights

Comments

Persistent Identifier

Recommended Citation

Included in

Share

Find

Connect