Theses Supervised by Ilyas Cicekli

Theses Supervised by Ilyas Cicekli

Mustafa Cataltaş,

Data Augmentation for Natural Language Processing

M.S. Thesis, June 2024

ABSTRACT: Advanced deep learning models have greatly improved various natural language processing tasks. While they perform best with abundant data, acquiring large datasets for each task is not always easy. Therefore, by using data augmentation techniques, comprehensive data sets can be obtained by creating synthetic samples from existing data. This thesis undertakes an examination of the efficacy of autoencoders as a textual data augmentation technique aimed at enhancing the performance of classification models in text classification tasks. The analysis encompasses the comparison of four distinct autoencoder types: Traditional Autoencoder (AE), Variational Autoencoder (VAE), Adversarial Autoencoder (AAE) and Denoising Adversarial Autoencoder (DAAE). Moreover, the study investigates the impact of different word embedding types, preprocessing methods, label-based filtering, and the number of training epochs on the performance of autoencoders. Experimental evaluations are conducted using the SST-2 sentiment classification dataset, consisting of 7791 training instances. For data augmentation experiments, subsets of 100, 200, 400, and 1000 randomly selected instances from this dataset were employed. Experimental evaluations involved augmenting data at ratios of 1:1, 1:2, 1:4, and 1:8 when working with small datasets. Comparative analysis with baseline models demonstrates the superiority of AE-based DA methods at a 1:1 augmentation ratio. These findings underscore the effectiveness of using autoencoders as data augmentation methods for optimizing text classification performance in NLP applications.

Seçkin Şen,

Weakly-Supervised Relation Extraction

M.S. Thesis, September 2023

ABSTRACT: Relation extraction is crucial for many natural language processing applications, such as question answering and text summarization. Although there are several different approaches for relation extraction and most of them use the supervised learning approach which requires a large training dataset. These extensive datasets must be hand-labeled by experts, making the annotation process time-consuming and expensive. Another approach that is utilized in this thesis is called weak supervised relation extraction. Using weak supervised learning, the cost of training data labeling can be reduced. In this thesis, we propose a weakly supervised relation extraction approach that is inspired by another weakly supervised model named REPEL. Both in REPEL and our relation extraction approach, extraction patterns are derived from unlabeled texts using given relation seed examples. In order to extract more useful extraction patterns, we introduce the use of labeling functions in our method. These labeling functions consist of simple rules to analyze the candidate pattern’s syntax and these labeling functions help to extract more confident candidate patterns. Our proposed method is tested on the same dataset used by REPEL in order to compare our results with the results obtained by REPEL. Tests are conducted both in English and Turkish. Both systems require a number of relation seed examples in order to learn patterns from the unlabeled data.

When fewer relation seed examples are used our method outperforms REPEL significantly. In experimental tests, our approach generally gives better results than REPEL for both languages. For the English test, approximately 15 times more successful than REPEL with few relation seeds. Even with more relation seeds, our approach remains more successful.

Mahmoud Hossein Zadeh

Behavior Analysis in Social Media

Ph.D. Thesis, June 2023

ABSTRACT: Purpose: Social unrest is a phenomenon that occurs in all countries, both developed and poor. The only difference is in the causes of these social unrest which is mostly economic in underdeveloped countries. The occurrence of protest and the role of social networks in it have always been debatable topics among researchers. Protest Event Analysis is important for government officials and social scientists. Here we present a new method for predicting protest events and identifying indicators of protests and violence by monitoring the content generated on Twitter.

Methods: By identifying these indicators, protests and the possibility of violence can be predicted and controlled more accurately. Twitter user behaviors such as opinion share and event log share are used as indicators and this study presents a new method based on Bayesian logistic regression algorithm for predicting protests and violence using Twitter user behaviors. According to the proposed method, users' event log share behaviors which include the rate of tweets containing date and time information is the reliable indicator for identifying protests. Users' opinion share behaviors which include hate-anger tweet rates is also best for identifying violence in protests.

Results: A research database consists of tweets generated on the BLM (Black Live Matters) movement after the death of George Floyd. According to information published on acleddata.com, protests and violence have been reported in various cities on specific dates. The dataset contains 1414 protest events and 3078 non-protest events from 460 cities in 37 U.S. states. Protest events include 1414 protests in the BLM movement between May 28 and June 30 among which 285 were violent and 1129 were peaceful. We tested our proposed method on this dataset and the occurrence of protests is predicted with 85% precision. It is also possible to predict violence in protests with 85% precision with our method on this dataset.

Conclusion: According to the research findings, the behavior of users on the Twitter social network is a reliable source for predicting incidents and violence. This study provides a successful method to predict small and large-scale protests, different from the existing literature focusing on large-scale protests.

Kadir Yalçın

A Plagiarism Detection System Based on Document Similarity

Ph.D. Thesis, September 2022

ABSTRACT: It is a common problem to find similar parts in two different documents or texts. Especially, a text suspected of plagiarism is likely to have similar characteristics with the source text. Plagiarism is defined as taking some or all of the writings of other people and showing them as their own, or expressing the ideas of others in different ways without citing the source. Today, it is observed that there is an increase in plagiarism cases with the development of technology. Therefore, in order to prevent plagiarism, various plagiarism detection programs have been used in universities and principles regarding plagiarism and scientific ethics have been added to education regulations.

In this thesis, a novel method for detecting external plagiarism is proposed. Both syntactic and semantic similarity features were used to identify the plagiarized parts of the text. Part-of-speech (POS) tags are used to detect plagiarized sections in the suspicious texts and their corresponding sections from the source texts. Each source sentence is indexed by a search engine according to its POS tag n-grams to access possible plagiarism candidate sentences rapidly. Suspicious sentences that converted to their POS tag n-grams are used as query to access source sentences. The search engine results returned from the queries enable to detect plagiarized parts of the suspicious document. The semantic relationship between words is measured with the word embedding technique called Word2Vec, and the longest common subsequence (LCS) approach is applied to calculate the semantic similarity between the source and suspicious sentences.

In this thesis, PAN-PC-11 dataset which was created for the evaluation of automatic plagiarism detection algorithms is used. The tests are carried out with different parameters and threshold values to evaluate the diversity of the results. In the experiments performed on the PAN-PC-11 dataset, the proposed method achieved the best performance in low and high obfuscation plagiarism cases compared to the plagiarism detection systems in the 3rd International Plagiarism Detection Competition (PAN11).

Ömer Şahin,

Evaluating the Use of Neural Ranking Methods in Search Engines

M.S. Thesis, January 2022

ABSTRACT: A search engine strikes a balance between effectiveness and efficiency to retrieve the best documents in a scalable way. Recent deep learning-based ranker methods prove effective and improve state of the art in relevancy metrics. However, unlike index-based retrieval methods, neural rankers like BERT do not scale to large datasets. In this thesis, we propose a query term weighting method that can be used with a standard inverted index without modifying it. Using a pairwise ranking loss, query term weights are learned using relevant and irrelevant document pairs for each query. The learned weights prove to be more effective than term recall values previously used for the task. We further show that these weights can be predicted with a BERT regression model and improve the performance of both a BM25 based index and an index already optimized with a term weighting function. In addition, we examine document term weighting methods in the literature that work by manipulating term frequencies or expanding documents for document retrieval tasks. Predicting weights with the help of contextual knowledge about document instead of term frequencies for documents terms significantly increase retrieval and ranking performance.

Anıl Özberk

Offensive Language Detection on Turkish Tweets with Bert Models

M.S. Thesis, January 2022

ABSTRACT: As the insulting statements increase on the online platform, these negative statements create a reaction and disturb the peace of society. Identifying these expressions as early as possible is important to protect the victims. Offensive language detection research has been increasing in recent years. Offensive Language Identification Dataset (OLID) was introduced to facilitate research on this topic. Examples in OLID were retrieved from Twitter and annotated manually. Offensive Language Identification Task comprises three subtasks. In Subtask A, the goal is to discriminate the data as offensive or non-offensive. Data is offensive if it contains insults, threats, or profanity. Five languages datasets including Turkish were offered for this task. The other two subtasks focus on the categorization of offense types (Subtask B) and targets (Subtask C). The last two subtasks mainly focus on English.

This study explores the effects of the usage of Bidirectional Encoder Representations from Transformers (BERT) models and fine-tuning techniques on offensive language detection on Turkish tweets. The BERT models we use are pre-trained in Turkish. We design the fine-tuning methods by considering the Turkish language and Twitter data. We emphasize the pre-trained BERT model’s importance on the performance of a downstream task. In addition, we also conduct experiments with classical models, such as logistic regression, decision tree, random forest, and SVM.

Samira Karimi Mansoub

Selective Personalization Using Topical User Profile to Improve Search Results

Ph.D. Thesis, October 2020

ABSTRACT: Personalization is a technique used inWeb search engines to improve the effectiveness of information retrieval systems. In the field of personalized web search has recently been doing a lot of research and applications. In this research, we first evaluate the effect of personalization for queries with different characteristics. With this analysis, the question of whether personalization should be applied for all queries in the same way or not is investigated. While personalizing some queries yields significant improvements on user experience by providing a ranking inline with the user preferences, it fails to improve or even degrades the effectiveness for less ambiguous queries. A potential for personalization metric can improve search engines by selectively applying a personalization.

Current methods for estimating the potential for personalization such as click entropy and topic entropy are based on the clicked document for query or query history. They have limitations like unavailability of the prior clicked data for new and unseen queries or queries without history.

In this thesis, the topic entropy measure is improved by integrating the user distribution to the metric, robust to the sparsity problem. This metric estimates the potential for personalization using a topical user profile created on user documents. In this way, we can overcome the cold start problem to estimate the potential for new queries and increase the accuracy of estimates for queries with history. Since to estimate the potential for personalization for queries without history there are not documents previously clicked. In this way, for unseen queries, we use topic distribution on the clicked document by a user or topical user profile instead of clicked documents for the queries.

Although in this thesis the main focus is on topic-based user profiles, since there is not more research on key phrase based user profiles in the process of personalization, we do a comparison research between keyphrase based and topic-based profiles. We examine how personalization can be integrated into the state of the art keyphrase extraction models by considering different models of supervised and unsupervised methods. We evaluate topic based and keyphrase-based user profiles using a re-ranking algorithm to complete the process of personalization using different datasets. In personalization using keyphrase base profiles, personalized models based on supervised keyphrase extraction approaches obtained more accuracy by 7% than unsupervised approaches but it does not improve compared to topicbased models.

In topic-based models, we use a combination of personalization in the level of user-specified and group profiling as part of the ranking process. In the previous ranking methods, more improvement in ranking is for the queries which match the user’s history. To take advantage of ranking for all queries, we present a group personalized topical model (GPTM) that uses groups obtained from clustered similar users on topical profiles. Experiments reveal that the proposed potential prediction method correlates with human query ambiguity judgments and group profiles based ranking method improve the Mean Reciprocal Rank by 8%.

Sinan Polat
Example-Based Machine Translation and Translation Memory System
M.S. Thesis, November 2016

ABSTRACT: With each passing day communication between two people who don’t speak the same language becomes more important. In this regard, the importance of the automatic translation systems has been increasing, too. In this thesis, we developed an example based machine translation and translation memory system that can do translation between Turkish and English.

Example-Based Machine Translation (EBMT) is a translation technique that leans on machine learning paradigm. Basically, EBMT is a corpus based approach that utilizes the translation by analogy concept. In this sense, according to our approach the translation templates between two languages are inferred from similarities and differences of the given translation examples by using machine learning techniques.

These inferred translation templates are used in the translation of other texts. The similarities and the differences between English sentences of two translation examples must correspond to the similarities and the differences between Turkish sentences of those translation examples. By using this information, the translation templates are inferred from the given translation examples.

Besides, when doing translation, helper programs which are named as translation memory systems are used. Translation memory is a storage environment which keeps previously translated sentences or phrases. When doing translation with translation memory system, the system retrieves the most similar translation examples to the sentence which translators want to translate. With the help of retrieved translation examples, translation of new sentence can be achieved more quickly.

In this thesis, in addition to developing a complete example based machine translation system, we aimed at developing a translation memory system and merging these two systems. We give importance of scalability of systems according to size of datasets which they used.

Merve Selçuk Şimşek
Example Based Machine Translation between Turkish and Turkish Sign Language
M.S. Thesis, July, 2016

ABSTRACT: Communication is one of the first necessities for human kind to live and survive. There are many different ways to communicate since centuries, yet there are mainly three ways for today's world: spoken, written and sign languages.

According to researches on the language usage of the deaf people, the deaf people commonly prefer sign language over other ways. They are much more different from many aspects comparing to the one using spoken language. We live in a world that being different might be hardly welcomed, additionally, they are minority. Most of the times they need helpers and/or interpreters on daily life and they are accompanied by human helpers. We intend to make a machine translation system between Turkish and Turkish Sign Language (TSL) with the belief of one day this novel work on Turkish would help these people to live independently.

We prefer to use an example-based machine translation (EBMT) approach which is a corpus-based approach and intend to create a bidirectional dynamic machine translation system between spoken languages and sign languages. We believe the fact that our approach makes our study a unique work for Turkish. In TSL's notation we use glosses and our study covers a bidirectional machine translation system between written Turkish and TSL glosses.

Gulshat Kessikbayeva
Example Based Machine Translation System between Kazakh and Turkish Supported by Statistical Language Model
PhD. Thesis, May, 2016

ABSTRACT: Example Based Translation System (EBMT) is an analogy-based type of Machine Translation (MT), where translation made according to aligned bilingual corpus. Moreover, there are a lot of different methodologies in MT and hybridization is also possible between these methods which focused on compounding the strongest sides of more than one MT approaches to provide better translation quality. There are two parts of Hybrid Machine Translation (HMT) such as guided part and information part.

Our work is guided by EBMT and a hybrid example based machine translation system between Kazakh and Turkish languages is presented here. Analyzing both languages at morphological level, then constructing morphological processors is one of the most important part of the system. Their morphological processors are used to obtain the lexical forms of the surface level words and the surface level forms of translation results at lexical level. Translation templates are kept at lexical level and they translate a given source language sentence at lexical level to a target language sentence at lexical level. Our bilingual corpora hold translation examples at surface level and their words are morphologically analyzed by appropriate morphological analyzer before they are fed into the learning module. Thus, translation templates are learned at morphological level from a bilingual parallel corpus between Turkish and Kazakh. Translations can be performed at both directions using these learned translation templates.

The system is supported by a statistical language model for the target language. Therefore, translation results are sorted according to both their confidence factors that are computed using the confidence factors of the translation templates used in those translations and statistical language model probabilities of those translation results. Thus, the statistical language model of the target language is used in the ordering of translation results in addition to translation template confidence factors in order obtain more precise translation results.

Our main aim with our hybrid example based machine translation system is to obtain more accurate translation results by pre-gained knowledge from target language resource. One of the reasons that we propose this hybrid approach is that monolingual language resources are more widely available than bilingual language resources. In this thesis, experiments show that we can rely on the combination of EBMT and SMT approaches, because it produces satisfying results.

Nicat Süleymanov
Developing A Platform That Supplies Processed Information from Internet Resources and Services
M.S. Thesis, September, 2015

ABSTRACT: Every day increasing information resources makes it harder to reach to the needed piece of information and users do not want 10 billion results from search engines, but they prefer 10 best matched answers, even if it exists they prefer the right answer. In this research, we present a Turkish question answering system that extracts the most suitable answer from internet services and resources. During the question analyzing period, the question class is determined, from the lexical and morphological properties of words in the question certain expressions are predicted and our two-stage solution approach tries to get the answer. Furthermore, to increase the success rate of the system, WordNet platform is used.

In information retrieval process, the system works over the documents using semantic web information instead of classic search engine retrieved documents. In order to reach easily to needed information among increasing resources, Tim Berner Lee's idea of semantic web information is used in this research. Dbpedia which extracts structural information from Wikipedia articles and also this structured information is accessible on the web. In our research, the matched subject-predicate-object triples with asked question is formulated to get answer in Turkish, for searching and getting Turkish equivalent of the information Wikipedia Search API and Bing Translate API is used.

Farhad Soleimanian Gharehchopogh
Open Domain Factoid Question Answering System
PhD. Thesis, September, 2015

ABSTRACT: Question Answering (QA) is a field of Artificial Intelligence (AI) and Information Retrieval (IR) and Natural Language Processing (NLP), and leads to generating systems that answer to questions natural language in open and closed domains, automatically. Question Answering Systems (QASs) have to deal different types of user questions. While answers for some simple questions can be short phrases, answers for some more complex questions can be short texts. A question with a single is known as a factoid question, and a question answering system that deals with factoid questions is called a factoid QAS.

In this thesis, we present a factoid QAS that consists of three phases: question processing, document/passage retrieval, and answer processing. In the question processing phase, we consider a new two-level category structure using machine learning techniques to generate search engine from user questions queries. Our factoid QAS uses the World Wide Web (WWW) as its corpus of texts and knowledge base in document/passage retrieval phase. Also, it is a pattern-based QAS using answer pattern matching technique in answer processing phase.

We also present a classification of existing QASs. The classification contains early QASs, rule based QASs, pattern based QASs, NLP based QASs and machine learning based QASs. Also, our factoid QAS uses two-level category structure which included 17 coarse-grained and 57 fine-grained Categories. The system utilizes from category structure in order to extract answers of questions consists of 570 questions originated from TREC-8, TREC-9 questions as training dataset and 570 other questions and TREC-8, TREC-9, and TREC-10 questions as testing datasets.

In our QAS, the query expansion step is very important and it affects the overall performance of our QAS. When an original user question is given as a query, the amount of retrieved relevant documents may not be enough. We present an automatic query expansion approach based on query templates and question types. New queries are generated from query templates of question categories and the category of a user question is found by a Naïve Bayes classification algorithm. New expanded queries are generated by filling gaps in query templates with two appropriate phrases. The first phrase is the question type phrase and it is found directly by the classification algorithm. The second phrase is the question phrase and it is detected from possible question templates by a Levenshtein distance algorithm.

Query templates for question types are created by analyzing possible questions in those question types. We evaluated our query expansion approach with two-level category structure with factoid question type’s include in TREC-8, TREC-9 and TREC-10 conference datasets. The results of our automatic query expansion approach outperform the results of manual query expansion approach.

After automatically learning answer patterns by querying the web, we use answer pattern sets for each question types. Answer patterns extracts answers from retrieved related text segments, and answer pattern can be generalization with Named Entity Recognition (NER). The NER is a sub-task of Information Extraction (IE) in answer processing phase and classifies terms in the textual documents into redefined categories of interest such as location name, person name, date of event and etc. The ranking of answers is based on frequency counting and Confidence Factor (CF) values of answer patterns.

The results of the system show that our approach is effective for question answering and it accomplishes 0.58 values Mean Reciprocal Rank (MRR) for our corpus fine-grained category class, 0.62 MRR values for coarse-grained category structure and 0.55 MRR values for evaluation by testing datasets on TREC-10. The results of the system have been compared with other QASs using standard measurement on TREC datasets.

Servet Taşcı
Content Based Media Tracking and News Recommendation System
M.S. Thesis, June, 2015

ABSTRACT: With the increasing use of Internet in our life, amount of unstructured data, and particularly amount of textual data, has increased dramatically. Thinking that the access point of users to this data is Internet, reliability and accuracy of these resources stands out as a concern. Besides multitude of resources, most resources have similar content and it is quite challenging to read only the needed news among these resources in a short time. It is also needed that accessed resource really includes the required information and that it is confirmed by the user. Recommender systems assess different characteristics of the users and correlate the accessed content and user and then evaluate the content according to the specific criteria and recommends to the user. First recommender systems were using simple content filtering features, but current systems use much more complicated calculations and algorithms and try to correlate many characteristics of users and the data. These improvements allowed usage of recommender systems as decision support systems.

This thesis aims at getting data from textual news resources, classification of data, summarization, and recommend the news by correlating the news with the characteristics of users. Basically, recommender systems mainly use three methods: content-based filtering, cooperative filtering, and mixed filtering. In our system, content-based filtering is used.

Gönenç Ercan
Lexical Cohesion Analysis for Topic Segmentation, Summarization and Keyphrase Extraction
PhD. Thesis, December, 2012

ABSTRACT: When we express some idea or story, it is inevitable to use words that are semantically related to each other. When this phenomena is exploited from the aspect of words in the language, it is possible to infer the level of semantic relationship between words by observing their distribution and use in discourse. From the aspect of discourse it is possible to model the structure of the document by observing the changes in the lexical cohesion in order to attack high level natural language processing tasks. In this research lexical cohesion is investigated from both of these aspects by first building methods for measuring semantic relatedness of word pairs and then using these methods in the tasks of topic segmentation, summarization and keyphrase extraction.

Measuring semantic relatedness of words requires prior knowledge about the words. Two different knowledge-bases are investigated in this research. The first knowledge base is a manually built network of semantic relationships, while the second relies on the distributional patterns in raw text corpora. In order to discover which method is effective in lexical cohesion analysis, a comprehensive comparison of state-of-the art methods in semantic relatedness is made.

For topic segmentation different methods using some form of lexical cohesion are present in the literature. While some of these confine the relationships only to word repetition or strong semantic relationships like synonymy, no other work uses the semantic relatedness measures that can be calculated for any two word pairs in the vocabulary. Our experiments suggest that topic segmentation performance improves methods using both classical relationships and word repetition. Furthermore, the experiments compare the performance of different semantic relatedness methods in a high level task. The detected topic segments are used in summarization, and achieves better results compared to a lexical chains based method that uses WordNet.

Finally, the use of lexical cohesion analysis in keyphrase extraction is investigated. Previous research shows that keyphrases are useful tools in document retrieval and navigation. While these point to a relation between keyphrases and document retrieval performance, no other work uses this relationship to identify keyphrases of a given document. We aim to establish a link between the problems of query performance prediction (QPP) and keyphrase extraction. To this end, features used in QPP are evaluated in keyphrase extraction using a Naive Bayes classifier. Our experiments indicate that these features improve the effectiveness of keyphrase extraction in documents of different length. More importantly, commonly used features of frequency and first position in text perform poorly on shorter documents, whereas QPP features are more robust and achieve better results.

Serhan Tatar
Automating Information Extraction Task For Turkish Texts
PhD. Thesis, January, 2011

ABSTRACT: Throughout history, mankind has often suffered from a lack of necessary resources. In today's information world, the challenge can sometimes be a wealth of resources. That is to say, an excessive amount of information implies the need to find and extract necessary information. Information extraction can be defined as the identification of selected types of entities, relations, facts or events in a set of unstructured text documents in a natural language.

The goal of our research is to build a system that automatically locates and extracts information from Turkish unstructured texts. Our study focuses on two basic IE tasks: Named Entity Recognition and Entity Relation Detection. Named Entity Recognition, finding named entities (persons, locations, organizations, etc.) located in unstructured texts, is one of the most fundamental IE tasks. Entity Relation Detection task tries to identify relationships between entities mentioned in text documents.

Using supervised learning strategy, the developed systems starts with a set of examples collected from a training dataset and generates the extraction rules from the given examples by using a carefully designed coverage algorithm. Moreover, several rule filtering and rule refinement techniques are utilized to maximize generalization and accuracy at the same. In order to obtain accurate generalization, we use several syntactic and semantic features of the text, including: orthographical, contextual, lexical and morphological features. In particular, morphological features of the text are effectively used in this study to increase the extraction performance for Turkish, an agglutinative language. Since the system does not rely on handcrafted rules/patterns, it does not heavily suffer from domain adaptability problem.

The results of the conducted experiments show that (1) the developed systems are successfully applicable to the Named Entity Recognition and Entity Relation Detection tasks, and (2) exploiting morphological features can significantly improve the performance of information extraction from Turkish, an agglutinative language.

Ergin Soysal
Ontology Based Information Extraction on Free Text Radiological Reports Using Natural Language Processing Approach
PhD. Thesis, September, 2010 (co-supervised by Nazife Baykal)

ABSTRACT: This thesis describes an information extraction system that is designed to process free text Turkish radiology reports in order to extract and convert the available information into a structured information model. The system uses natural language processing techniques together with domain ontology in order to transform the verbal descriptions into a target information model, so that they can be used for computational purposes. The developed domain ontology is effectively used in entity recognition and relation extraction phases of the information extraction task. The ontology provides the flexibility in the design of extraction rules, and the structure of the ontology also determines the information model that describes the structure of the extracted semantic information. In addition, some of the missing terms in the sentences are identified with the help of the ontology. One of the main contributions of this thesis is the usage of ontology in information extraction that increases the expressive power of extraction rules and helps to determine missing items in the sentences. The system is the first information extraction system for Turkish texts. Since Turkish is a morphologically rich language, the system uses a morphological analyzer and the extraction rules are also based on the morphological features. TRIES achieved 93% recall and 98% precision results in the performance evaluations.

Mücahid Kutlu
Noun Phrase Chunker For Turkish Using Dependency Parser
M.S. Thesis, July, 2010

ABSTRACT: Noun phrase chunking is a sub-category of shallow parsing that can be used for many natural language processing tasks. In this thesis, we propose a noun phrase chunker system for Turkish texts. We use a weighted constraint dependency parser to represent the relationship between sentence components and to determine noun phrases.

The dependency parser uses a set of hand-crafted rules which can combine morphological and semantic information for constraints. The rules are suitable for handling complex noun phrase structures because of their flexibility. The developed dependency parser can be easily used for shallow parsing of all phrase types by changing the employed rule set.

The lack of reliable human tagged datasets is a significant problem for natural language studies about Turkish. Therefore, we constructed the first noun phrase dataset for Turkish. According to our evaluation results, our noun phrase chunker gives promising results on this dataset.

The correct morphological disambiguation of words is required for the correctness of the dependency parser. Therefore, in this thesis, we propose a hybrid morphological disambiguation technique which combines statistical information, hand-crafted grammar rules, and transformation based learning rules. We have also constructed a dataset for testing the performance of our disambiguation system. According to tests, the disambiguation system is highly effective.

Filiz Alaca Aygül
Natural Language Query Processing In Ontology Based Multimedia Databases
M.S. Thesis, April, 2010 (co-supervised by Nihan Cicekli)

ABSTRACT: In this thesis a natural language query interface is developed for semantic and spatio-temporal querying of MPEG-7 based domain ontologies. The underlying ontology is created by attaching domain ontologies to the core Rhizomik MPEG-7 ontology. The user can pose concept, complex concept (objects connected with an “AND” or “OR” connector), spatial (left, right. . . ), temporal (before, after, at least 10 minutes before, 5 minutes after . . . ), object trajectory and directional trajectory (east, west, southeast . . . , left, right, upwards . . . ) queries to the system. Furthermore, the system handles the negative meaning in the user input. When the user enters a natural language (NL) input, it is parsed with the link parser. According to query type, the objects, attributes, spatial relation, temporal relation, trajectory relation, time filter and time information are extracted from the parser output by using predefined rules. After the information extraction, SPARQL queries are generated, and executed against the ontology by using an RDF API. Results are retrieved and they are used to calculate spatial, temporal, and trajectory relations between objects. The results satisfying the required relations are displayed in a tabular format and user can navigate through the multimedia content.

Kezban Demirtaş
Automatic Video Categorization And Summarization
M.S. Thesis, September, 2009 (co-supervised by Nihan Cicekli)

ABSTRACT: Today people have access to a large amount of video and finding a video of interest became a difficult and time-consuming job. It can be infeasible for a human to go through all available videos to find the video of interest. Automatically categorizing video and presenting a semantic summary of the video could provide people a significant advantage in this subject. In this thesis, we make automatic video categorization and summarization by using subtitles of videos.

We propose two methods for video categorization. First method makes unsupervised categorization by applying natural language processing techniques on video subtitles and uses the WordNet lexical database and the WordNet domains. The method starts with text preprocessing. Then a keyword extraction algorithm and a word sense disambiguation method are applied. The WordNet domains that correspond to the correct senses of keywords are extracted. Video is assigned a category label based on the extracted domains. Second method has the same steps for extracting WordNet domains of video but makes categorization by using a learning module. Experiments with documentary videos give promising results in discovering the correct categories of videos.

Video summarization algorithms present condensed versions of a full length video by identifying the most significant parts of the video. We propose a video summarization method using the subtitles of videos and text summarization techniques. We identify the significant sentences of subtitle of a video by using text summarization techniques and then we compose a video summary by finding the video parts corresponding to these summary sentences.

Nagehan Pala Er
Turkish Factoid Question Answering Using Answer Pattern Matching
M.S. Thesis, July, 2009

ABSTRACT: Efficiently locating information on the Web has become one of the most important challenges in the last decade. The Web Search Engines have been used to locate the documents containing the required information. However, in many situations a user wants a particular piece of information rather than a document set. Question Answering (QA) systems have addressed this problem and they return explicit answers to questions rather than set of documents. Questions addressed by QA systems can be categorized into five categories: factoid, list, definition, complex, and speculative questions. A factoid question has exactly one correct answer, and the answer is mostly a named entity like person, date, or location. In this thesis, we develop a pattern matching approach for a Turkish Factoid QA system. In TREC-10 QA track, most of the question answering systems used sophisticated linguistic tools. However, the best performing system at the track used only an extensive list of surface patterns; therefore, we decided to investigate the potential of answer pattern matching approach for our Turkish Factoid QA system. We try different methods for answer pattern extraction such as stemming and named entity tagging. We also investigate query expansion by using answer patterns. Several experiments have been performed to evaluate the performance of the system. Compared with the results of the other factoid QA systems, our methods have achieved good results. The results of the experiments show that named entity tagging improves the performance of the system.

Özkan Göktürk
Metadata Extraction From Text In Soccer Domain
M.S. Thesis, September, 2008 (co-supervised by Nihan Cicekli)

ABSTRACT: Video databases and content based retrieval in these databases have become popular with the improvements in technology. Metadata extraction techniques are used for providing data to video content. One popular metadata extraction technique for multimedia is information extraction from text. For some domains, it is possible to find accompanying text with the video, such as soccer domain, movie domain and news domain. In this thesis, we present an approach of metadata extraction from match reports for soccer domain. The UEFA Cup and UEFA Champions League Match Reports are downloaded from the web site of UEFA by a web-crawler. These match reports are preprocessed by using regular expressions and then important events are extracted by using hand-written rules. In addition to hand-written rules, two different machine learning techniques are applied on match corpus to learn event patterns and automatically extract match events. Extracted events are saved in an MPEG-7 file. A user interface is implemented to query the events in the MPEG-7 match corpus and view the corresponding video segments.

Turhan Osman Daybelge
Improving The Precision Of Example-Based Machine Translation By Learning From User Feedback
M.S. Thesis, September 2007

ABSTRACT: Example-Based Machine Translation (EBMT) is a corpus based approach to Machine Translation (MT) that utilizes the translation by analogy concept. In our EBMT system, translation templates are extracted automatically from bilingual aligned corpora, by substituting the similarities and differences in pairs of translation examples with variables. As this process is done on the lexical-level forms of the translation examples, and words in natural language texts are often morphologically ambiguous, a need for morphological disambiguation arises. Therefore, we present here a rule-based morphological disambiguator for Turkish. In earlier versions of the discussed system, the translation results were solely ranked using confidence factors of the translation templates. In this study, however, we introduce an improved ranking mechanism that dynamically learns from user feedback. When a user, such as a professional human translator, submits his evaluation of the generated translation results, the system learns “context-dependent co-occurrence rules” from this feedback. The newly learned rules are later consulted, while ranking the results of the following translations. Through successive translation-evaluation cycles, we expect that the output of the ranking mechanism complies better with user expectations, listing the more preferred results in higher ranks. The evaluation of our ranking method, using the precision value at top 1, 3 and 5 results and the BLEU metric, is also presented.

Hande Doğan
Example Based Machine Translation with Type Associated Translation Examples
M.S. Thesis, January 2007

ABSTRACT: Example based machine translation is a translation technique that leans on machine learning paradigm. This technique had been modeled by the learning process as: a man is given short and simple sentences in language A with their correspondences in language B; he memorizes these pairs and then becomes able to translate new sentences via these pairs in the memory. In our system the translation pairs are kept as translation templates. A translation template is induced from given two translation examples by replacing differing parts in these examples by variables. A variable replacing a difference that consists of two differing parts (one from the first example, and the other one from the second example) is a generalization of those two differing parts and these variables are supported with part-of-speech tag information in order to deteriorate incorrect translations. After the learning phase, translation is achieved by finding the appropriate template(s) and replacing the variables. ( pdf copy )

Yasin Uzun
Induction Of Logical Relations Based On Specific Generalization Of Strings
M.S. Thesis, January 2007

ABSTRACT: Learning logical relations from examples expressed as first order facts has been studied extensively by the Inductive Logic Programming research. Learning with positive-only data may cause overgeneralization of examples leading to inconsistent resulting hypotheses. A learning heuristic inferring specific generalization of strings based on unique match sequences is shown to be capable of learning predicates with string arguments. This thesis outlines the effort showed to build an inductive learner based on the idea of specific generalization of strings that generalizes given clauses considering the background knowledge using least general generalization schema. The system is also extended to generalize predicates having numeric arguments and shown to be capable of learning concepts such as family relations, grammar learning and predicting mutagenecity using numeric data. ( pdf copy )

Gönenç Ercan
Automated Text Summarization And Keyphrase Extraction
M.S. Thesis, September 2006

ABSTRACT: As the number of electronic documents increase rapidly, the need for faster techniques to asses the relevance of documents emerges. A summary can be considered as a concise representation of the underlying text. To form an ideal summary, a full understanding of the document is essential. For computers, full understanding is difficult, if not impossible. Thus, selecting important sentences from the original text and presenting these sentences as a summary is a common technique in automated text summarization research.

The lexical cohesion structure of the text can be exploited to determine the importance of a sentence/phrase. Lexical chains are useful tools to analyze the lexical cohesion structure in a text. This thesis discusses our research on automated text summarization and keyphrase extraction using lexical chains. We investigate the effect of the use of lexical cohesion features in keyphrase extraction, with a supervised machine learning algorithm. Our summarization algorithm constructs the lexical chains, detects topics roughly from lexical chains, segments the text with respect to the topics and selects the most important sentences. Our experiments show that lexical cohesion based features improve keyphrase extraction. Our summarization algorithm has achieved good results, compared to some other lexical cohesion based algorithms. ( pdf copy )

Özlem İstek
A Link Grammar For Turkish
M.S. Thesis, August 2006

ABSTRACT: Syntactic parsing, or syntactic analysis, is the process of analyzing an input sequence in order to determine its grammatical structure, i.e. the formal relationships between the words of a sentence, with respect to a given grammar. In this thesis, we developed the grammar of Turkish language in the link grammar formalism. In the grammar, we used the output of a fully described morphological analyzer, which is very important for agglutinative languages like Turkish. The grammar that we developed is lexical such that we used the lexemes of only some function words and for the rest of the word classes we used the morphological feature structures. In addition, we preserved the some of the syntactic roles of the intermediate derived forms of words in our system. ( pdf copy )

Barış Eker
Turkish Text to Speech System
M.S. Thesis, April 2002

ABSTRACT: Scientists have been interested in producing human speech artificially for more than two centuries. After the invention of computers, computers are used in order to synthesize speech. By the help of this new technology, Text To Speech (TTS) systems that take a text as input and produce speech as output were started to be created. Some languages like English and French has taken most of the attention and some languages like Turkish has not been taken into consideration.

This thesis presents a TTS system for Turkish that uses diphone concatenation method. It takes a text as input and produces corresponding speech in Turkish. The output can be obtained in one male voice only in that system. Since Turkish is a phonetic language, this system also can be used for other phonetic languages with some minor modifications. If this system is integrated with a pronunciation unit, it can also be used for languages that are not phonetic. ( pdf copy )

Göker Canıtezer
Generalization of Predicates with String Arguments
M.S. Thesis, January 2002

ABSTRACT: String/sequence generalization is used in many different areas such as machine learning, example-based machine translation and DNA sequence alignment. In this thesis, a method is proposed to find the generalizations of the predicates with string arguments from the given examples. Trying to learn from examples is a very hard problem in machine learning, since finding the global optimal point to stop generalization is a difficult and time consuming process. All the work done until now is about employing a heuristic to find the best solution. This work is one of them. In this study, some restrictions applied by the SLGG (Specific Least General Generalization) algorithm, which is developed to be used in an example-based machine translation system, are relaxed to find the all possible alignments of two strings. Moreover, a Euclidian distance like scoring mechanism is used to find the most specific generalizations. Some of the generated templates are eliminated by four different selection/filtering approaches to get a good solution set. Finally, the result set is presented as a decision list, which provides the handling of exceptional cases. ( pdf copy )

Kemal Altıntaş
Turkish to CrimeanTatar Machine Translation System
M.S. Thesis, July 2001

ABSTRACT: Machine translation has always been interesting to people since the invention of computers. Most of the research has been conducted on western languages such as English and French, and Turkish and Turkic languages have been left out of the scene. Machine translation between closely related languages is easier than between language pairs that are not related with each other. Having many parts of their grammars and vocabularies in common reduces the amount of effort needed to develop a translation system between related languages. A translation system that makes a morphological analysis supported by simpler translation rules and context dependent bilingual dictionaries would suffice most of the time. Usually a semantic analysis may not be needed.

This thesis presents a machine translation system from Turkish to Crimean Tatar that uses finite state techniques for the translation process. By developing a machine translation system between Turkish and Crimean Tatar, we propose a sample model for translation between close pairs of languages. The system we developed takes a Turkish sentence, analyses all the words morphologically, translates the grammatical and context dependent structures, translates the root words and finally morphologically generates the Crimean Tatar text. Most of the time, at least one of the outputs is a true translation of the input sentence. ( pdf copy )

Atacan Çundoroğlu
Error Tolerant Finite State Parsing for a Turkish Dialogue System
M.S. Thesis, July 2001

ABSTRACT: In NLP (Natural Language Processing), high level grammar formalisms are frequently employed for parsing. Since in practice no formalism can cope with the diversity and the flexibility of the human languages, such formalisms are used in closed domains, with sub-languages. Even though we believe that in an open world sophisticated analysis is required for extracting meaning from natural language texts, this does not have to be the case for the closed domains. Simpler time-efficient finite state methods can be used in closed domains. With their simplicity and time-efficiency, finite state methods are not only responsive, but also easy to augment with error tolerance which allows these methods to flexibly parse mildly ungrammatical sentences. In this thesis, we present a parser module which is based on error tolerant finite state recognition and a grammar for parsing transcribed dialogue utterances in a closed Turkish banking domain. Test results on the syntheticly created erroneous sentences indicate that the proposed system can analyze ungrammatical sentences efficiently and can scale with the growth of the grammar. ( postscript copy )

Umut Topkara
Prefix-Suffix Based Statistical Language Model for Turkish
M.S. Thesis, July 2001

ABSTRACT: As large amount of online text became available, concisely representing quantitative information about language and doing inference on this information for natural language applications have become an attractive research area. Statistical language models try to estimate the unknown probability distribution P(u) that is assumed to have produced large text corpora of linguistic units u. This probability distribution estimate is used to improve the performance of many natural language processing applications including speech recognition (ASR), optical character recognition (OCR), spelling and grammar correction, machine translation and document classification. Statistical language modeling has been successfully applied to English. However, this good performance of approaches to statistical modeling of English does not apply to Turkish. Turkish has a productive agglutinative morphology, that is, it's possible to derive thousands of word forms from a given root word through adding suffixes. When, statistical modeling by word units is used, this lucrative vocabulary structure causes data sparseness problems in general and serious space problems in time-memory critical applications such as speech recognition.

According to a recent Ph.D. thesis by Hakkani-Tur, using fixed size prefix and suffix parts of words for statistical modeling of Turkish performs better than using whole words for the task of selecting the most likely sequence of words from a list of candidate words emitted by a speech recognizer. After these successful results, we have made further research on using smaller units for statistical modeling of Turkish. We have used fixed number of syllables for prefix and suffix parts. In our experiments we have used small vocabulary of prefixes and suffixes to test the robustness of our approach. We also compared the performance of prefix-suffix language models having 2-word context with word 2-gram models. We have found a language model that uses subword units and can perform as well as a large word based language model in 2-word context and still be half in size. ( postscript copy )

Ayse Pınar Saygın
Turing Test and Conversation
M.S. Thesis, July 1999

ABSTRACT: The Turing Test is one of the most disputed topics in Artificial Intelligence, Philosophy of Mind and Cognitive Science. It has been proposed 50 years ago, as a method to determine whether machines can think or not. It embodies important philosophical issues, as well as computational ones. Moreover, because of its characteristics, it requires interdisciplinary attention. The Turing Test posits that, to be granted intelligence, a computer should imitate human conversational behavior so well that it should be indistinguishable from a real human being. From this, it follows that conversation is a crucial concept in its study. Surprisingly, focusing on conversation in relation to the Turing Test has not been a prevailing approach in previous research. This thesis first provides a thorough and deep review of the 50 years of the Turing Test. Philosophical arguments, computational concerns, and repercussions in other disciplines are all discussed. Furthermore, this thesis studies the Turing Test as a special kind of conversation. In doing so, the relationship between existing theories of conversation and human-computer communication is explored. In particular, Grice's cooperative principle and conversational maxims are concentrated on. Viewing the Turing Test as conversation and computers as language users have significant effects on the way we look at Artificial Intelligence and on communication in general. ( postscript copy )

Zeynep Orhan
Confidence Factor Assignment to Translation Templates
M.S. Thesis, September 1998

ABSTRACT: TTL (Translation Template Learner) algorithm learns lexical level correspondences between two translation examples by using analogical reasoning. The sentences used as translation examples have similar and different parts in the source language which must correspond to the similar and different parts in the target language. Therefore, these correspondences are learned as translation templates. The learned translation templates are used in the translation of other sentences. However, we need to assign confidence factors to these translation templates to order translation results with respect to previously assigned confidence factors. This thesis proposes a method for assigning confidence factors to translation templates learned by the TTL algorithm. In this process, each template is assigned a confidence factor according to the statistical information obtained from training data. Furthermore, some template combinations are also assigned confidence factors in order to eliminate certain combinations resulting bad translation. ( pdf copy )

Selman Murat Temizsoy
Design and Implementation of a System for Mapping Text Meaning Representations to F-Structures of Turkish Sentences
M.S. Thesis, August 1997

ABSTRACT: Interlingua approach to Machine Translation (MT) aims to achieve the translation task in two independent steps. First, the meanings of source language sentences are represented in a language-independent artificial language. Then, sentences of the target language are generated from those meaning representations. Generation task in this approach is performed in three major steps among which the second step creates the syntactic structure of a sentence from its meaning representation and selects the words to be used in that sentence. This thesis focuses on the design and the implementation of a prototype system that performs this second task. The meaning representation used in this work utilizes a hierarchical world representation, ontology, to denote events and entities, and embeds semantic and pragmatic issues with special frames. The developed system is language-independent and it takes information about the target language from three knowledge resources: lexicon (word knowledge), map-rules (the relation between the meaning representation and the syntactic structure), and target language's syntactic structure representation. It performs two major tasks in processing the meaning representation: lexical selection and mapping the two representations of a sentence. The implemented system is tested on Turkish using small-sized knowledge resources developed for Turkish. The output of the system can be fed as input to a tactical generator, which is developed for Turkish, to produce the final Turkish sentences. ( pdf copy )

Dilek Zeynep Hakkani
Design and Implementation of a Tactical Generator for Turkish, A Free Constituent Order Language
M.S. Thesis, July 1996 (co-supervised by Kemal Oflazer)

ABSTRACT: This thesis describes a tactical generator for Turkish, a free constituent order language, in which the order of the constituents may change according to the information structure of the sentences to be generated. In the absence of any information regarding the information structure of a sentence (i.e., topic, focus, background, etc.), the constituents of the sentence obey a default order, but the order is almost freely changeable, depending on the constraints of the text flow or discourse. We have used a recursively structured finite state machine for handling the changes in constituent order, implemented as a right-linear grammar backbone. Our implementation environment is the GenKit system, developed at Carnegie Mellon University--Center for Machine Translation. Morphological realization has been implemented using an external morphological analysis/generation component which performs concrete morpheme selection and handles morphographemic processes. ( pdf copy )

Turgay Korkmaz
Turkish Text Generation with Systemic-Functional Grammar
M.S. Thesis, June 1996

ABSTRACT: Natural Language Generation (NLG) is roughly decomposed into two stages: text planning, and text generation. In the text planning stage, the semantic description of the text is produced from the conceptual inputs. Then, the text generation system transforms this semantic description into an actual text. This thesis focuses on the design and implementation of a Turkish text generation system rather than text planning. To develop a text generator, we need a linguistic theory that describes the resources of the desired natural language, and also a software tool that represents and performs these linguistic resources in a computational environment. In this thesis, in order to carry out the mentioned requirements, we have used a functional linguistic theory called Systemic--Functional Grammar (SFG), and the FUF text generation system as a software tool. The ultimate text generation system takes the semantic description of the text sentence by sentence, and then produces a morphological description for each lexical constituent of the sentence. The morphological descriptions are worded by a Turkish morphological generator. Because of our concentration on the text generation, we have not considered the details of the text planning. Hence, we assume that the semantic description of the text is produced and lexicalized by an application (currently given by hand). (pdf copy)