Theses Supervised by Ilyas Cicekli
Mustafa Cataltaş,
Data Augmentation for Natural Language Processing
M.S. Thesis, June 2024
ABSTRACT: Advanced deep learning
models have greatly improved various natural language processing tasks. While
they perform best with abundant data, acquiring large datasets for each task is
not always easy. Therefore, by using data augmentation techniques,
comprehensive data sets can be obtained by creating synthetic samples from
existing data. This thesis undertakes an examination of the efficacy of
autoencoders as a textual data augmentation technique aimed at enhancing the
performance of classification models in text classification tasks. The analysis
encompasses the comparison of four distinct autoencoder types: Traditional
Autoencoder (AE), Variational Autoencoder (VAE), Adversarial Autoencoder (AAE)
and Denoising Adversarial Autoencoder (DAAE). Moreover, the study investigates
the impact of different word embedding types, preprocessing methods,
label-based filtering, and the number of training epochs on the performance of
autoencoders. Experimental evaluations are conducted using the SST-2 sentiment
classification dataset, consisting of 7791 training instances. For data
augmentation experiments, subsets of 100, 200, 400, and 1000 randomly selected
instances from this dataset were employed. Experimental evaluations involved
augmenting data at ratios of 1:1, 1:2, 1:4, and 1:8 when working with small
datasets. Comparative analysis with baseline models demonstrates the
superiority of AE-based DA methods at a 1:1 augmentation ratio. These findings
underscore the effectiveness of using autoencoders as data augmentation methods
for optimizing text classification performance in NLP applications.
Seçkin Şen,
Weakly-Supervised Relation Extraction
M.S. Thesis, September 2023
ABSTRACT: Relation extraction is
crucial for many natural language processing applications, such as question
answering and text summarization. Although there are several different
approaches for relation extraction and most of them use the supervised learning
approach which requires a large training dataset. These extensive datasets must
be hand-labeled by experts, making the annotation process time-consuming and
expensive. Another approach that is utilized in this thesis is called weak
supervised relation extraction. Using weak supervised learning, the cost of
training data labeling can be reduced. In this thesis, we propose a weakly
supervised relation extraction approach that is inspired by another weakly
supervised model named REPEL. Both in REPEL and our relation extraction
approach, extraction patterns are derived from unlabeled texts using given
relation seed examples. In order to extract more useful extraction patterns, we
introduce the use of labeling functions in our method. These labeling functions
consist of simple rules to analyze the candidate patterns syntax and these
labeling functions help to extract more confident candidate patterns. Our
proposed method is tested on the same dataset used by REPEL in order to compare
our results with the results obtained by REPEL. Tests are conducted both in
English and Turkish. Both systems require a number of relation seed examples in
order to learn patterns from the unlabeled data.
When fewer relation seed examples are used our
method outperforms REPEL significantly. In experimental tests, our approach
generally gives better results than REPEL for both languages. For the English
test, approximately 15 times more successful than REPEL with few relation
seeds. Even with more relation seeds, our approach remains more successful.
Mahmoud Hossein Zadeh
Behavior Analysis in Social Media
Ph.D.
Thesis, June 2023
ABSTRACT: Purpose: Social unrest is a
phenomenon that occurs in all countries, both developed and poor. The only
difference is in the causes of these social unrest which is mostly economic in
underdeveloped countries. The occurrence of protest and the role of social
networks in it have always been debatable topics among researchers. Protest
Event Analysis is important for government officials and social scientists.
Here we present a new method for predicting protest events and identifying
indicators of protests and violence by monitoring the content generated on
Twitter.
Methods: By identifying these indicators, protests
and the possibility of violence can be predicted and controlled more
accurately. Twitter user behaviors such as opinion share and event log share
are used as indicators and this study presents a new method based on Bayesian
logistic regression algorithm for predicting protests and violence using
Twitter user behaviors. According to the proposed method, users' event log
share behaviors which include the rate of tweets containing date and time
information is the reliable indicator for identifying protests. Users' opinion
share behaviors which include hate-anger tweet rates is also best for
identifying violence in protests.
Results: A research database consists of tweets
generated on the BLM (Black Live Matters) movement after the death of George
Floyd. According to information published on acleddata.com, protests and
violence have been reported in various cities on specific dates. The dataset
contains 1414 protest events and 3078 non-protest events from 460 cities in 37
U.S. states. Protest events include 1414 protests in the BLM movement between May
28 and June 30 among which 285 were violent and 1129 were peaceful. We tested
our proposed method on this dataset and the occurrence of protests is predicted
with 85% precision. It is also possible to predict violence in protests with
85% precision with our method on this dataset.
Conclusion: According to the research findings, the
behavior of users on the Twitter social network is a reliable source for
predicting incidents and violence. This study provides a successful method to
predict small and large-scale protests, different from the existing literature
focusing on large-scale protests.
Kadir Yalçın
A Plagiarism Detection System Based on Document
Similarity
Ph.D.
Thesis, September 2022
ABSTRACT: It is a common problem to
find similar parts in two different documents or texts. Especially, a text
suspected of plagiarism is likely to have similar characteristics with the
source text. Plagiarism is defined as taking some or all of the writings of
other people and showing them as their own, or expressing the ideas of others
in different ways without citing the source. Today, it is observed that there
is an increase in plagiarism cases with the development of technology.
Therefore, in order to prevent plagiarism, various plagiarism detection
programs have been used in universities and principles regarding plagiarism and
scientific ethics have been added to education regulations.
In this thesis, a novel method for detecting
external plagiarism is proposed. Both syntactic and semantic similarity
features were used to identify the plagiarized parts of the text.
Part-of-speech (POS) tags are used to detect plagiarized sections in the
suspicious texts and their corresponding sections from the source texts. Each
source sentence is indexed by a search engine according to its POS tag n-grams
to access possible plagiarism candidate sentences rapidly. Suspicious sentences
that converted to their POS tag n-grams are used as query to access source
sentences. The search engine results returned from the queries enable to detect
plagiarized parts of the suspicious document. The semantic relationship between
words is measured with the word embedding technique called Word2Vec, and the
longest common subsequence (LCS) approach is applied to calculate the semantic
similarity between the source and suspicious sentences.
In this thesis, PAN-PC-11 dataset which was created
for the evaluation of automatic plagiarism detection algorithms is used. The
tests are carried out with different parameters and threshold values to evaluate
the diversity of the results. In the experiments performed on the PAN-PC-11
dataset, the proposed method achieved the best performance in low and high
obfuscation plagiarism cases compared to the plagiarism detection systems in
the 3rd International Plagiarism Detection Competition (PAN11).
Ömer Şahin,
Evaluating the Use of Neural Ranking Methods in Search
Engines
M.S. Thesis, January 2022
ABSTRACT: A search engine strikes a
balance between effectiveness and efficiency to retrieve the best documents in
a scalable way. Recent deep learning-based ranker methods prove effective and
improve state of the art in relevancy metrics. However, unlike index-based
retrieval methods, neural rankers like BERT do not scale to large datasets. In
this thesis, we propose a query term weighting method that can be used with a
standard inverted index without modifying it. Using a pairwise ranking loss,
query term weights are learned using relevant and irrelevant document pairs for
each query. The learned weights prove to be more effective than term recall
values previously used for the task. We further show that these weights can be
predicted with a BERT regression model and improve the performance of both a
BM25 based index and an index already optimized with a term weighting function.
In addition, we examine document term weighting methods in the literature that
work by manipulating term frequencies or expanding documents for document
retrieval tasks. Predicting weights with the help of contextual knowledge about
document instead of term frequencies for documents terms significantly increase
retrieval and ranking performance.
Anıl Özberk
Offensive Language Detection on Turkish Tweets with
Bert Models
M.S. Thesis, January 2022
ABSTRACT: As the insulting statements increase on the online platform, these negative statements create a reaction and disturb the peace of society. Identifying these expressions as early as possible is important to protect the victims. Offensive language detection research has been increasing in recent years. Offensive Language Identification Dataset (OLID) was introduced to facilitate research on this topic. Examples in OLID were retrieved from Twitter and annotated manually. Offensive Language Identification Task comprises three subtasks. In Subtask A, the goal is to discriminate the data as offensive or non-offensive. Data is offensive if it contains insults, threats, or profanity. Five languages datasets including Turkish were offered for this task. The other two subtasks focus on the categorization of offense types (Subtask B) and targets (Subtask C). The last two subtasks mainly focus on English.
This study explores the effects of
the usage of Bidirectional Encoder Representations from Transformers (BERT)
models and fine-tuning techniques on offensive language detection on Turkish
tweets. The BERT models we use are pre-trained in Turkish. We design the
fine-tuning methods by considering the Turkish language and Twitter data. We
emphasize the pre-trained BERT models importance on the performance of a
downstream task. In addition, we also conduct experiments with classical
models, such as logistic regression, decision tree, random forest, and SVM.
Samira Karimi Mansoub
Selective Personalization Using Topical User Profile
to Improve Search Results
Ph.D.
Thesis, October 2020
ABSTRACT: Personalization is a
technique used inWeb search engines to improve the
effectiveness of information retrieval systems. In the field of personalized
web search has recently been doing a lot of research and applications. In this
research, we first evaluate the effect of personalization for queries with
different characteristics. With this analysis, the question of whether
personalization should be applied for all queries in the same way or not is
investigated. While personalizing some queries yields significant improvements
on user experience by providing a ranking inline with
the user preferences, it fails to improve or even degrades the effectiveness
for less ambiguous queries. A potential for personalization metric can improve
search engines by selectively applying a personalization.
Current methods for estimating the potential for
personalization such as click entropy and topic entropy are based on the
clicked document for query or query history. They have limitations like
unavailability of the prior clicked data for new and unseen queries or queries
without history.
In this thesis, the topic entropy measure is
improved by integrating the user distribution to the metric, robust to the
sparsity problem. This metric estimates the potential for personalization using
a topical user profile created on user documents. In this way, we can overcome
the cold start problem to estimate the potential for new queries and increase
the accuracy of estimates for queries with history. Since to estimate the
potential for personalization for queries without history there are not
documents previously clicked. In this way, for unseen queries, we use topic
distribution on the clicked document by a user or topical user profile instead
of clicked documents for the queries.
Although in this thesis the main focus is on
topic-based user profiles, since there is not more research on key phrase based user profiles in the process of
personalization, we do a comparison research between keyphrase
based and topic-based profiles. We examine how personalization can be
integrated into the state of the art keyphrase
extraction models by considering different models of supervised and
unsupervised methods. We evaluate topic based and keyphrase-based
user profiles using a re-ranking algorithm to complete the process of
personalization using different datasets. In personalization using keyphrase base profiles, personalized models based on
supervised keyphrase extraction approaches obtained
more accuracy by 7% than unsupervised approaches but it does not improve
compared to topicbased models.
In topic-based models, we use a combination of
personalization in the level of user-specified and group profiling as part of
the ranking process. In the previous ranking methods, more improvement in
ranking is for the queries which match the users history. To take advantage of
ranking for all queries, we present a group personalized topical model (GPTM)
that uses groups obtained from clustered similar users on topical profiles.
Experiments reveal that the proposed potential prediction method correlates
with human query ambiguity judgments and group profiles based ranking method
improve the Mean Reciprocal Rank by 8%.
Sinan Polat
Example-Based Machine Translation and Translation Memory System
M.S. Thesis, November 2016
ABSTRACT: With each passing day communication between two people
who dont speak the same language becomes more important. In this regard, the importance
of the automatic translation systems has been increasing, too. In this thesis,
we developed an example based machine translation and translation memory system
that can do translation between Turkish and English.
Example-Based
Machine Translation (EBMT) is a translation technique that leans on machine
learning paradigm. Basically, EBMT is a corpus based approach that utilizes the
translation by analogy concept. In this sense, according to our approach the
translation templates between two languages are inferred from similarities and
differences of the given translation examples by using machine learning
techniques.
These
inferred translation templates are used in the translation of other texts. The similarities and the differences between
English sentences of two translation examples must correspond to the
similarities and the differences between Turkish sentences of those translation
examples. By using this information, the
translation templates are inferred from the given translation examples.
Besides,
when doing translation, helper programs which are named as translation memory
systems are used. Translation memory is a storage environment which keeps
previously translated sentences or phrases. When doing translation with
translation memory system, the system retrieves the most similar translation
examples to the sentence which translators want to translate. With the help of
retrieved translation examples, translation of new sentence can be achieved
more quickly.
In
this thesis, in addition to developing a complete example based machine
translation system, we aimed at developing a translation memory system and
merging these two systems. We give importance of scalability of systems
according to size of datasets which they used.
Merve Selçuk Şimşek
Example Based Machine Translation between Turkish and Turkish Sign
Language
M.S. Thesis, July, 2016
ABSTRACT: Communication is one of the first necessities for
human kind to live and survive. There are many different ways to communicate
since centuries, yet there are mainly three ways for today's world: spoken,
written and sign languages.
According
to researches on the language usage of the deaf people, the deaf people
commonly prefer sign language over other ways. They are much more different
from many aspects comparing to the one using spoken language. We live in a
world that being different might be hardly welcomed, additionally, they are
minority. Most of the times they need helpers and/or interpreters on daily life
and they are accompanied by human helpers. We intend to make a machine
translation system between Turkish and Turkish Sign Language (TSL) with the
belief of one day this novel work on Turkish would help these people to live
independently.
We
prefer to use an example-based machine translation (EBMT) approach which is a
corpus-based approach and intend to create a bidirectional dynamic machine
translation system between spoken languages and sign languages. We believe the
fact that our approach makes our study a unique work for Turkish. In TSL's
notation we use glosses and our study covers a bidirectional machine
translation system between written Turkish and TSL glosses.
Gulshat Kessikbayeva
Example Based Machine Translation System between Kazakh and Turkish
Supported by Statistical Language Model
PhD. Thesis, May, 2016
ABSTRACT: Example Based
Translation System (EBMT) is an analogy-based type of Machine Translation (MT),
where translation made according to aligned bilingual corpus. Moreover, there are a lot of different
methodologies in MT and hybridization is also possible between these methods
which focused on compounding the strongest sides of more than one MT approaches
to provide better translation quality. There are two parts of Hybrid Machine
Translation (HMT) such as guided part and information part.
Our work is guided by EBMT and a hybrid example
based machine translation system between Kazakh and Turkish languages is
presented here. Analyzing both languages at morphological level, then
constructing morphological processors is one of the most important part of the
system. Their morphological processors are used to obtain the lexical forms of
the surface level words and the surface level forms of translation results at
lexical level. Translation templates are kept at lexical level and they
translate a given source language sentence at lexical level to a target
language sentence at lexical level. Our bilingual corpora hold translation
examples at surface level and their words are morphologically analyzed by
appropriate morphological analyzer before they are fed into the learning
module. Thus, translation templates are learned at morphological level from a
bilingual parallel corpus between Turkish and Kazakh. Translations can be
performed at both directions using these learned translation templates.
The system is supported by a statistical language
model for the target language. Therefore, translation results are sorted
according to both their confidence factors that are computed using the
confidence factors of the translation templates used in those translations and
statistical language model probabilities of those translation results. Thus,
the statistical language model of the target language is used in the ordering
of translation results in addition to translation template confidence factors
in order obtain more precise translation results.
Our main aim with our hybrid example based machine
translation system is to obtain more accurate translation results by pre-gained
knowledge from target language resource. One of the reasons that we propose
this hybrid approach is that monolingual language resources are more widely
available than bilingual language resources. In this thesis, experiments show
that we can rely on the combination of EBMT and SMT approaches, because it
produces satisfying results.
Nicat Süleymanov
Developing A Platform That Supplies Processed Information from Internet
Resources and Services
M.S. Thesis, September, 2015
ABSTRACT: Every day increasing information resources makes it
harder to reach to the needed piece of information and users do not want 10
billion results from search engines, but they prefer 10 best matched answers,
even if it exists they prefer the right answer. In this research, we present a
Turkish question answering system that extracts the most suitable answer from
internet services and resources. During the question analyzing period, the
question class is determined, from the lexical and morphological properties of
words in the question certain expressions are predicted and our two-stage
solution approach tries to get the answer. Furthermore, to increase the success
rate of the system, WordNet platform is used.
In
information retrieval process, the system works over the documents using
semantic web information instead of classic search engine retrieved documents.
In order to reach easily to needed information among increasing resources, Tim
Berner Lee's idea of semantic web information is used in this research. Dbpedia which extracts structural information from
Wikipedia articles and also this structured information is accessible on the
web. In our research, the matched subject-predicate-object triples with asked
question is formulated to get answer in Turkish, for searching and getting
Turkish equivalent of the information Wikipedia Search API and Bing Translate
API is used.
Farhad Soleimanian Gharehchopogh
Open Domain Factoid Question Answering System
PhD. Thesis, September, 2015
ABSTRACT: Question Answering
(QA) is a field of Artificial Intelligence (AI) and Information Retrieval (IR)
and Natural Language Processing (NLP), and leads to generating systems that
answer to questions natural language in open and closed domains, automatically.
Question Answering Systems (QASs) have to deal different types of user
questions. While answers for some simple questions can be short phrases,
answers for some more complex questions can be short texts. A question with a
single is known as a factoid question, and a question answering system that
deals with factoid questions is called a factoid QAS.
In this thesis, we present a factoid QAS that consists of three phases:
question processing, document/passage retrieval, and answer processing. In the
question processing phase, we consider a new two-level category structure using
machine learning techniques to generate search engine from user questions
queries. Our factoid QAS uses the World Wide Web (WWW) as its corpus of texts
and knowledge base in document/passage retrieval phase. Also, it is a
pattern-based QAS using answer pattern matching technique in answer processing
phase.
We also present a classification of existing QASs. The classification
contains early QASs, rule based QASs, pattern based QASs, NLP based QASs and
machine learning based QASs. Also, our factoid QAS uses two-level category structure
which included 17 coarse-grained and 57 fine-grained Categories. The system
utilizes from category structure in order to extract answers of questions
consists of 570 questions originated from TREC-8, TREC-9 questions as training
dataset and 570 other questions and TREC-8, TREC-9, and TREC-10 questions as
testing datasets.
In our QAS, the query expansion step is very important and it affects the
overall performance of our QAS. When an original user question is given as a
query, the amount of retrieved relevant documents may not be enough. We present
an automatic query expansion approach based on query templates and question
types. New queries are generated from query templates of question categories
and the category of a user question is found by a Naïve Bayes classification
algorithm. New expanded queries are generated by filling gaps in query
templates with two appropriate phrases. The first phrase is the question type
phrase and it is found directly by the classification algorithm. The second phrase
is the question phrase and it is detected from possible question templates by a
Levenshtein distance algorithm.
Query templates for question types are created by analyzing possible
questions in those question types. We evaluated our query expansion approach
with two-level category structure with factoid question types include in
TREC-8, TREC-9 and TREC-10 conference datasets. The results of our automatic
query expansion approach outperform the results of manual query expansion
approach.
After automatically learning answer patterns by querying the web, we use
answer pattern sets for each question types. Answer patterns extracts answers
from retrieved related text segments, and answer pattern can be generalization
with Named Entity Recognition (NER). The NER is a sub-task of Information
Extraction (IE) in answer processing phase and classifies terms in the textual
documents into redefined categories of interest such as location name, person
name, date of event and etc. The ranking of answers is based on frequency
counting and Confidence Factor (CF) values of answer patterns.
The results of the system show that our approach is effective for
question answering and it accomplishes 0.58 values Mean Reciprocal Rank (MRR)
for our corpus fine-grained category class, 0.62 MRR values for coarse-grained
category structure and 0.55 MRR values for evaluation by testing datasets on
TREC-10. The results of the system have been compared with other QASs using
standard measurement on TREC datasets.
Servet Taşcı
Content Based Media Tracking and News Recommendation System
M.S. Thesis, June, 2015
ABSTRACT: With the increasing use of Internet in our life,
amount of unstructured data, and particularly amount of textual data, has
increased dramatically. Thinking that the access point of users to this data is
Internet, reliability and accuracy of these resources stands out as a concern.
Besides multitude of resources, most resources have similar content and it is
quite challenging to read only the needed news among these resources in a short
time. It is also needed that accessed resource really includes the required
information and that it is confirmed by the user. Recommender systems assess
different characteristics of the users and correlate the accessed content and
user and then evaluate the content according to the specific criteria and
recommends to the user. First recommender systems were using simple content
filtering features, but current systems use much more complicated calculations
and algorithms and try to correlate many characteristics of users and the data.
These improvements allowed usage of recommender systems as decision support
systems.
This
thesis aims at getting data from textual news resources, classification of
data, summarization, and recommend the news by correlating the news with the
characteristics of users. Basically, recommender systems mainly use three
methods: content-based filtering, cooperative filtering, and mixed filtering.
In our system, content-based filtering is used.
Gönenç Ercan
Lexical Cohesion Analysis for Topic Segmentation, Summarization and Keyphrase Extraction
PhD. Thesis, December, 2012
ABSTRACT: When we express some idea or story,
it is inevitable to use words that are semantically related to each other. When
this phenomena is exploited from the aspect of words in the language, it is
possible to infer the level of semantic relationship between words by observing
their distribution and use in discourse. From the aspect of discourse it is
possible to model the structure of the document by observing the changes in the
lexical cohesion in order to attack high level natural language processing
tasks. In this research lexical cohesion is investigated from both of these
aspects by first building methods for measuring semantic relatedness of word pairs
and then using these methods in the tasks of topic segmentation, summarization
and keyphrase extraction.
Measuring
semantic relatedness of words requires prior knowledge about the words. Two
different knowledge-bases are investigated in this research. The first
knowledge base is a manually built network of semantic relationships, while the
second relies on the distributional patterns in raw text corpora. In order to
discover which method is effective in lexical cohesion analysis, a
comprehensive comparison of state-of-the art methods in semantic relatedness is
made.
For
topic segmentation different methods using some form of lexical cohesion are
present in the literature. While some of these confine the relationships only
to word repetition or strong semantic relationships like synonymy, no other
work uses the semantic relatedness measures that can be calculated for any two
word pairs in the vocabulary. Our experiments suggest that topic segmentation
performance improves methods using both classical relationships and word
repetition. Furthermore, the experiments compare the performance of different
semantic relatedness methods in a high level task. The detected topic segments
are used in summarization, and achieves better results compared to a lexical chains
based method that uses WordNet.
Finally,
the use of lexical cohesion analysis in keyphrase
extraction is investigated. Previous research shows that keyphrases
are useful tools in document retrieval and navigation. While these point to a
relation between keyphrases and document retrieval
performance, no other work uses this relationship to identify keyphrases of a given document. We aim to establish a link
between the problems of query performance prediction (QPP) and keyphrase extraction. To this end, features used in QPP are
evaluated in keyphrase extraction using a Naive Bayes
classifier. Our experiments indicate that these features improve the
effectiveness of keyphrase extraction in documents of
different length. More importantly, commonly used features of frequency and
first position in text perform poorly on shorter documents, whereas QPP
features are more robust and achieve better results.
Serhan Tatar
Automating Information Extraction Task For Turkish Texts
PhD. Thesis, January, 2011
ABSTRACT: Throughout history, mankind has often
suffered from a lack of necessary resources. In today's information world, the
challenge can sometimes be a wealth of resources. That is to say, an excessive
amount of information implies the need to find and extract necessary
information. Information extraction can be defined as the identification of selected types of entities,
relations, facts or events in a set of unstructured text documents in a natural
language.
The
goal of our research is to build a system that automatically locates and
extracts information from Turkish unstructured texts. Our study focuses on two
basic IE tasks: Named Entity Recognition and Entity Relation Detection. Named
Entity Recognition, finding named entities (persons, locations, organizations,
etc.) located in unstructured texts, is one of the most fundamental IE tasks.
Entity Relation Detection task tries to identify relationships between entities
mentioned in text documents.
Using
supervised learning strategy, the developed systems starts with a set of
examples collected from a training dataset and generates the extraction rules
from the given examples by using a carefully designed coverage algorithm.
Moreover, several rule filtering and rule refinement techniques are utilized to
maximize generalization and accuracy at the same. In order to obtain accurate
generalization, we use several syntactic and semantic features of the text,
including: orthographical, contextual, lexical and morphological features. In
particular, morphological features of the text are effectively used in this
study to increase the extraction performance for Turkish, an agglutinative
language. Since the system does not rely on handcrafted rules/patterns, it does
not heavily suffer from domain adaptability problem.
The results
of the conducted experiments show that (1) the developed systems are
successfully applicable to the Named Entity Recognition and Entity Relation
Detection tasks, and (2) exploiting morphological features can significantly
improve the performance of information extraction from Turkish, an
agglutinative language.
Ergin Soysal
Ontology Based Information Extraction on Free Text Radiological Reports
Using Natural Language Processing Approach
PhD. Thesis, September, 2010 (co-supervised by Nazife
Baykal)
ABSTRACT: This thesis describes an information extraction system
that is designed to process free text Turkish radiology reports in order to
extract and convert the available information into a structured information
model. The system uses natural language processing techniques together with
domain ontology in order to transform the verbal descriptions into a target
information model, so that they can be used for computational purposes. The
developed domain ontology is effectively used in entity recognition and
relation extraction phases of the information extraction task. The ontology
provides the flexibility in the design of extraction rules, and the structure
of the ontology also determines the information model that describes the
structure of the extracted semantic information. In addition, some of the
missing terms in the sentences are identified with the help of the ontology.
One of the main contributions of this thesis is the usage of ontology in
information extraction that increases the expressive power of extraction rules
and helps to determine missing items in the sentences. The system is the first
information extraction system for Turkish texts. Since Turkish is a
morphologically rich language, the system uses a morphological analyzer and the
extraction rules are also based on the morphological features. TRIES achieved
93% recall and 98% precision results in the performance evaluations.
Mücahid Kutlu
Noun Phrase Chunker For Turkish Using
Dependency Parser
M.S. Thesis, July, 2010
ABSTRACT: Noun phrase chunking is a sub-category of shallow
parsing that can be used for many natural language processing tasks. In this
thesis, we propose a noun phrase chunker system for
Turkish texts. We use a weighted constraint dependency parser to represent the
relationship between sentence components and to determine noun phrases.
The
dependency parser uses a set of hand-crafted rules which can combine
morphological and semantic information for constraints. The rules are suitable
for handling complex noun phrase structures because of their flexibility. The
developed dependency parser can be easily used for shallow parsing of all
phrase types by changing the employed rule set.
The
lack of reliable human tagged datasets is a significant problem for natural
language studies about Turkish. Therefore, we constructed the first noun phrase
dataset for Turkish. According to our evaluation results, our noun phrase chunker gives promising results on this dataset.
The
correct morphological disambiguation of words is required for the correctness
of the dependency parser. Therefore, in this thesis, we propose a hybrid
morphological disambiguation technique which combines statistical information,
hand-crafted grammar rules, and transformation based learning rules. We have also
constructed a dataset for testing the performance of our disambiguation system.
According to tests, the disambiguation system is highly effective.
Filiz Alaca Aygül
Natural Language Query Processing In Ontology
Based Multimedia Databases
M.S. Thesis, April, 2010 (co-supervised by Nihan
Cicekli)
ABSTRACT: In this thesis a natural language query interface is
developed for semantic and spatio-temporal querying
of MPEG-7 based domain ontologies. The underlying ontology is created by
attaching domain ontologies to the core Rhizomik
MPEG-7 ontology. The user can pose concept, complex concept (objects connected
with an AND or OR connector), spatial (left, right. . . ), temporal
(before, after, at least 10 minutes before, 5 minutes after . . . ), object
trajectory and directional trajectory (east, west, southeast . . . , left,
right, upwards . . . ) queries to the system. Furthermore, the system handles
the negative meaning in the user input. When the user enters a natural language
(NL) input, it is parsed with the link parser. According to query type, the
objects, attributes, spatial relation, temporal relation, trajectory relation,
time filter and time information are extracted from the parser output by using
predefined rules. After the information extraction, SPARQL queries are
generated, and executed against the ontology by using an RDF API. Results are
retrieved and they are used to calculate spatial, temporal, and trajectory
relations between objects. The results satisfying the required relations are
displayed in a tabular format and user can navigate through the multimedia
content.
Kezban Demirtaş
Automatic Video Categorization And Summarization
M.S. Thesis, September, 2009 (co-supervised by Nihan
Cicekli)
ABSTRACT: Today people have access to a large amount of video
and finding a video of interest became a difficult and time-consuming job. It
can be infeasible for a human to go through all available videos to find the
video of interest. Automatically categorizing video and presenting a semantic
summary of the video could provide people a significant advantage in this
subject. In this thesis, we make automatic video categorization and
summarization by using subtitles of videos.
We
propose two methods for video categorization. First method makes unsupervised
categorization by applying natural language processing techniques on video
subtitles and uses the WordNet lexical database and the WordNet domains. The
method starts with text preprocessing. Then a keyword extraction algorithm and
a word sense disambiguation method are applied. The WordNet domains that
correspond to the correct senses of keywords are extracted. Video is assigned a
category label based on the extracted domains. Second method has the same steps
for extracting WordNet domains of video but makes categorization by using a
learning module. Experiments with documentary videos give promising results in
discovering the correct categories of videos.
Video
summarization algorithms present condensed versions of a full length video by
identifying the most significant parts of the video. We propose a video
summarization method using the subtitles of videos and text summarization
techniques. We identify the significant sentences of subtitle of a video by
using text summarization techniques and then we compose a video summary by
finding the video parts corresponding to these summary sentences.
Nagehan Pala Er
Turkish Factoid Question Answering Using Answer Pattern Matching
M.S. Thesis, July, 2009
ABSTRACT: Efficiently locating information on the Web has become
one of the most important challenges in the last decade. The Web Search Engines
have been used to locate the documents containing the required information.
However, in many situations a user wants a particular piece of information
rather than a document set. Question Answering (QA) systems have addressed this
problem and they return explicit answers to questions rather than set of
documents. Questions addressed by QA systems can be categorized into five
categories: factoid, list, definition, complex, and speculative questions. A
factoid question has exactly one correct answer, and the answer is mostly a
named entity like person, date, or location. In this thesis, we develop a
pattern matching approach for a Turkish Factoid QA system. In TREC-10 QA track,
most of the question answering systems used sophisticated linguistic tools.
However, the best performing system at the track used only an extensive list of
surface patterns; therefore, we decided to investigate the potential of answer
pattern matching approach for our Turkish Factoid QA system. We try different
methods for answer pattern extraction such as stemming and named entity
tagging. We also investigate query expansion by using answer patterns. Several
experiments have been performed to evaluate the performance of the system.
Compared with the results of the other factoid QA systems, our methods have
achieved good results. The results of the experiments show that named entity
tagging improves the performance of the system.
Özkan Göktürk
Metadata Extraction From Text In Soccer Domain
M.S. Thesis, September, 2008 (co-supervised by Nihan
Cicekli)
ABSTRACT: Video databases and content based retrieval in these
databases have become popular with the improvements in technology. Metadata
extraction techniques are used for providing data to video content. One popular
metadata extraction technique for multimedia is information extraction from
text. For some domains, it is possible to find accompanying text with the
video, such as soccer domain, movie domain and news domain. In this thesis, we
present an approach of metadata extraction from match reports for soccer
domain. The UEFA Cup and UEFA Champions League Match Reports are downloaded
from the web site of UEFA by a web-crawler. These match reports are
preprocessed by using regular expressions and then important events are
extracted by using hand-written rules. In addition to hand-written rules, two
different machine learning techniques are applied on match corpus to learn
event patterns and automatically extract match events. Extracted events are
saved in an MPEG-7 file. A user interface is implemented to query the events in
the MPEG-7 match corpus and view the corresponding video segments.
Turhan Osman Daybelge
Improving The Precision Of Example-Based Machine
Translation By Learning From User Feedback
M.S. Thesis, September 2007
ABSTRACT: Example-Based Machine Translation (EBMT) is a corpus
based approach to Machine Translation (MT) that utilizes the translation by
analogy concept. In our EBMT system, translation templates are extracted
automatically from bilingual aligned corpora, by substituting the similarities
and differences in pairs of translation examples with variables. As this
process is done on the lexical-level forms of the translation examples, and
words in natural language texts are often morphologically ambiguous, a need for
morphological disambiguation arises. Therefore, we present here a rule-based
morphological disambiguator for Turkish. In earlier
versions of the discussed system, the translation results were solely ranked
using confidence factors of the translation templates. In this study, however,
we introduce an improved ranking mechanism that dynamically learns from user
feedback. When a user, such as a professional human translator, submits his
evaluation of the generated translation results, the system learns
context-dependent co-occurrence rules from this feedback. The newly learned
rules are later consulted, while ranking the results of the following
translations. Through successive translation-evaluation cycles, we expect that
the output of the ranking mechanism complies better with user expectations,
listing the more preferred results in higher ranks. The evaluation of our
ranking method, using the precision value at top 1, 3 and 5 results and the
BLEU metric, is also presented.
Hande Doğan
Example Based Machine Translation with Type Associated Translation
Examples
M.S. Thesis, January 2007
ABSTRACT: Example based machine translation is a translation
technique that leans on machine learning paradigm. This technique had been
modeled by the learning process as: a man is given short and simple sentences
in language A with their correspondences in language B; he memorizes these
pairs and then becomes able to translate new sentences via these pairs in the
memory. In our system the translation pairs are kept as translation templates.
A translation template is induced from given two translation examples by
replacing differing parts in these examples by variables. A variable replacing
a difference that consists of two differing parts (one from the first example,
and the other one from the second example) is a generalization of those two
differing parts and these variables are supported with part-of-speech tag
information in order to deteriorate incorrect translations. After the learning
phase, translation is achieved by finding the appropriate template(s) and
replacing the variables. ( pdf
copy )
Yasin Uzun
Induction Of Logical Relations Based On Specific
Generalization Of Strings
M.S. Thesis, January 2007
ABSTRACT: Learning logical relations from examples expressed as
first order facts has been studied extensively by the Inductive Logic Programming
research. Learning with positive-only data may cause overgeneralization of
examples leading to inconsistent resulting hypotheses. A learning heuristic
inferring specific generalization of strings based on unique match sequences is
shown to be capable of learning predicates with string arguments. This thesis
outlines the effort showed to build an inductive learner based on the idea of
specific generalization of strings that generalizes given clauses considering
the background knowledge using least general generalization schema. The system
is also extended to generalize predicates having numeric arguments and shown to
be capable of learning concepts such as family relations, grammar learning and
predicting mutagenecity using numeric data. ( pdf
copy )
Gönenç Ercan
Automated Text Summarization And Keyphrase Extraction
M.S.
Thesis, September 2006
ABSTRACT: As the
number of electronic documents increase rapidly, the need for faster techniques
to asses the relevance of documents emerges. A summary can be considered as a
concise representation of the underlying text. To form an ideal summary, a full
understanding of the document is essential. For computers, full understanding
is difficult, if not impossible. Thus, selecting important sentences from the
original text and presenting these sentences as a summary is a common technique
in automated text summarization research.
The lexical cohesion structure of the text can be
exploited to determine the importance of a sentence/phrase. Lexical chains are
useful tools to analyze the lexical cohesion structure in a text. This thesis
discusses our research on automated text summarization and keyphrase
extraction using lexical chains. We investigate the effect of the use of
lexical cohesion features in keyphrase extraction,
with a supervised machine learning algorithm. Our summarization algorithm
constructs the lexical chains, detects topics roughly from lexical chains,
segments the text with respect to the topics and selects the most important
sentences. Our experiments show that lexical cohesion based features improve keyphrase extraction. Our summarization algorithm has
achieved good results, compared to some other lexical cohesion based
algorithms. ( pdf
copy )
Özlem İstek
A Link Grammar For Turkish
M.S.
Thesis, August 2006
ABSTRACT: Syntactic
parsing, or syntactic analysis, is the process of analyzing an input sequence
in order to determine its grammatical structure, i.e. the formal relationships
between the words of a sentence, with respect to a given grammar. In this
thesis, we developed the grammar of Turkish language in the link grammar
formalism. In the grammar, we used the output of a fully described
morphological analyzer, which is very important for agglutinative languages
like Turkish. The grammar that we developed is lexical such that we used the
lexemes of only some function words and for the rest of the word classes we
used the morphological feature structures. In addition, we preserved the some
of the syntactic roles of the intermediate derived forms of words in our
system. ( pdf
copy )
Barış Eker
Turkish Text to Speech System
M.S.
Thesis, April 2002
ABSTRACT: Scientists have been interested in producing
human speech artificially for more than two centuries. After the invention of computers,
computers are used in order to synthesize speech. By the help of this new
technology, Text To Speech (TTS) systems that take a text as input and produce
speech as output were started to be created. Some languages like English and
French has taken most of the attention and some languages like Turkish has not
been taken into consideration.
This thesis presents a TTS system for Turkish that
uses diphone concatenation method. It takes a text as
input and produces corresponding speech in Turkish. The output can be obtained
in one male voice only in that system. Since Turkish is a phonetic language,
this system also can be used for other phonetic languages with some minor
modifications. If this system is integrated with a pronunciation unit, it can
also be used for languages that are not phonetic. ( pdf
copy )
Göker Canıtezer
Generalization of Predicates with String Arguments
M.S.
Thesis, January 2002
ABSTRACT: String/sequence generalization is
used in many different areas such as machine learning, example-based machine
translation and DNA sequence alignment. In this thesis, a method is proposed to
find the generalizations of the predicates with string arguments from the given
examples. Trying to learn from examples is a very hard problem in machine
learning, since finding the global optimal point to stop generalization is a
difficult and time consuming process. All the work done until now is about
employing a heuristic to find the best solution. This work is one of them. In
this study, some restrictions applied by the SLGG (Specific Least General
Generalization) algorithm, which is developed to be used in an example-based
machine translation system, are relaxed to find the all possible alignments of
two strings. Moreover, a Euclidian distance like scoring mechanism is used to
find the most specific generalizations. Some of the generated templates are
eliminated by four different selection/filtering approaches to get a good
solution set. Finally, the result set is presented as a decision list, which
provides the handling of exceptional cases. ( pdf
copy )
Kemal Altıntaş
Turkish to CrimeanTatar Machine Translation System
M.S.
Thesis, July 2001
ABSTRACT: Machine translation has always been interesting to people
since the invention of computers. Most of the research has been conducted on
western languages such as English and French, and Turkish and Turkic languages
have been left out of the scene. Machine translation between closely related
languages is easier than between language pairs that are not related with each
other. Having many parts of their grammars and vocabularies in common reduces
the amount of effort needed to develop a translation system between related
languages. A translation system that makes a morphological analysis supported
by simpler translation rules and context dependent bilingual dictionaries would
suffice most of the time. Usually a semantic analysis may not be needed.
This thesis presents a machine translation system from
Turkish to Crimean Tatar that uses finite state techniques for the translation
process. By developing a machine translation system between Turkish and Crimean
Tatar, we propose a sample model for translation between close pairs of
languages. The system we developed takes a Turkish sentence, analyses all the
words morphologically, translates the grammatical and context dependent
structures, translates the root words and finally morphologically generates the
Crimean Tatar text. Most of the time, at least one of the outputs is a true
translation of the input sentence. ( pdf
copy )
Atacan Çundoroğlu
Error Tolerant Finite State Parsing for a Turkish Dialogue System
M.S.
Thesis, July 2001
ABSTRACT: In NLP (Natural Language Processing), high level
grammar formalisms are frequently employed for parsing. Since in practice no
formalism can cope with the diversity and the flexibility of the human
languages, such formalisms are used in closed domains, with sub-languages. Even
though we believe that in an open world sophisticated analysis is required for
extracting meaning from natural language texts, this does not have to be the
case for the closed domains. Simpler time-efficient finite state methods can be
used in closed domains. With their simplicity and time-efficiency, finite state
methods are not only responsive, but also easy to augment with error tolerance
which allows these methods to flexibly parse mildly ungrammatical sentences. In
this thesis, we present a parser module which is based on error tolerant finite
state recognition and a grammar for parsing transcribed dialogue utterances in
a closed Turkish banking domain. Test results on the syntheticly
created erroneous sentences indicate that the proposed system can analyze
ungrammatical sentences efficiently and can scale with the growth of the
grammar. ( postscript
copy )
Umut Topkara
Prefix-Suffix Based Statistical Language Model for Turkish
M.S.
Thesis, July 2001
ABSTRACT: As large
amount of online text became available, concisely representing quantitative
information about language and doing inference on this information for natural
language applications have become an attractive research area. Statistical
language models try to estimate the unknown probability distribution P(u) that
is assumed to have produced large text corpora of linguistic units u. This
probability distribution estimate is used to improve the performance of many
natural language processing applications including speech recognition (ASR),
optical character recognition (OCR), spelling and grammar correction, machine
translation and document classification. Statistical language modeling has been
successfully applied to English. However, this good performance of approaches
to statistical modeling of English does not apply to Turkish. Turkish has a
productive agglutinative morphology, that is, it's possible to derive thousands
of word forms from a given root word through adding suffixes. When, statistical
modeling by word units is used, this lucrative vocabulary structure causes data
sparseness problems in general and serious space problems in time-memory
critical applications such as speech recognition.
According to a recent Ph.D.
thesis by Hakkani-Tur, using fixed size prefix and
suffix parts of words for statistical modeling of Turkish performs better than
using whole words for the task of selecting the most likely sequence of words
from a list of candidate words emitted by a speech recognizer. After these
successful results, we have made further research on using smaller units for
statistical modeling of Turkish. We have used fixed number of syllables for
prefix and suffix parts. In our experiments we have used small vocabulary of
prefixes and suffixes to test the robustness of our approach. We also compared
the performance of prefix-suffix language models having 2-word context with
word 2-gram models. We have found a language model that uses subword units and can perform as well as a large word based
language model in 2-word context and still be half in size. ( postscript
copy )
Ayse Pınar Saygın
Turing Test and Conversation
M.S.
Thesis, July 1999
ABSTRACT: The Turing Test is one of
the most disputed topics in Artificial Intelligence, Philosophy of Mind and
Cognitive Science. It has been proposed 50 years ago, as a method to determine
whether machines can think or not. It embodies important philosophical issues,
as well as computational ones. Moreover, because of its characteristics, it
requires interdisciplinary attention. The Turing Test posits that, to be
granted intelligence, a computer should imitate human conversational behavior so
well that it should be indistinguishable from a real human being. From this, it
follows that conversation is a crucial concept in its study. Surprisingly,
focusing on conversation in relation to the Turing Test has not been a
prevailing approach in previous research. This thesis first provides a thorough
and deep review of the 50 years of the Turing Test. Philosophical arguments,
computational concerns, and repercussions in other disciplines are all
discussed. Furthermore, this thesis studies the Turing Test as a special kind
of conversation. In doing so, the relationship between existing theories of
conversation and human-computer communication is explored. In particular,
Grice's cooperative principle and conversational maxims are concentrated on.
Viewing the Turing Test as conversation and computers as language users have
significant effects on the way we look at Artificial Intelligence and on
communication in general. ( postscript
copy )
Zeynep Orhan
Confidence Factor Assignment to Translation Templates
M.S.
Thesis, September 1998
ABSTRACT:
TTL (Translation
Template Learner) algorithm learns lexical level correspondences between
two translation examples by using analogical reasoning. The sentences used as
translation examples have similar and different parts in the source language
which must correspond to the similar and different parts in the target
language. Therefore, these correspondences are learned as translation
templates. The learned translation templates are used in the translation of
other sentences. However, we need to assign confidence factors to these
translation templates to order translation results with respect to previously
assigned confidence factors. This thesis proposes a method for assigning
confidence factors to translation templates learned by the TTL algorithm.
In this process, each template is assigned a confidence factor according to the
statistical information obtained from training data. Furthermore, some template
combinations are also assigned confidence factors in order to eliminate certain
combinations resulting bad translation. ( pdf
copy )
Selman Murat Temizsoy
Design and Implementation of a System for Mapping Text Meaning
Representations to F-Structures of Turkish Sentences
M.S.
Thesis, August 1997
ABSTRACT: Interlingua
approach to Machine Translation (MT) aims to achieve the translation task in
two independent steps. First, the meanings of source language sentences are
represented in a language-independent artificial language. Then, sentences of
the target language are generated from those meaning representations.
Generation task in this approach is performed in three major steps among which
the second step creates the syntactic structure of a sentence from its meaning
representation and selects the words to be used in that sentence. This thesis
focuses on the design and the implementation of a prototype system that
performs this second task. The meaning representation used in this work
utilizes a hierarchical world representation, ontology,
to denote events and entities, and embeds semantic and pragmatic issues with
special frames. The developed system is language-independent and it takes
information about the target language from three knowledge resources: lexicon
(word knowledge), map-rules (the relation
between the meaning representation and the syntactic structure), and target
language's syntactic structure representation. It performs two major tasks in
processing the meaning representation: lexical selection and mapping the two
representations of a sentence. The implemented system is tested on Turkish
using small-sized knowledge resources developed for Turkish. The output of the
system can be fed as input to a tactical generator, which is developed for
Turkish, to produce the final Turkish sentences. ( pdf
copy )
Dilek Zeynep Hakkani
Design and Implementation of a Tactical Generator for Turkish, A Free
Constituent Order Language
M.S.
Thesis, July 1996 (co-supervised by Kemal Oflazer)
ABSTRACT: This thesis
describes a tactical generator for Turkish, a free constituent order language,
in which the order of the constituents may change according to the information
structure of the sentences to be generated. In the absence of any information
regarding the information structure of a sentence (i.e., topic, focus,
background, etc.), the constituents of the sentence obey a default order, but
the order is almost freely changeable, depending on the constraints of the text
flow or discourse. We have used a recursively structured finite state machine
for handling the changes in constituent order, implemented as a right-linear
grammar backbone. Our implementation environment is the GenKit
system, developed at Carnegie Mellon University--Center for Machine
Translation. Morphological realization has been implemented using an external
morphological analysis/generation component which performs concrete morpheme
selection and handles morphographemic processes. ( pdf
copy )
Turgay Korkmaz
Turkish Text Generation with Systemic-Functional Grammar
M.S.
Thesis, June 1996
ABSTRACT: Natural Language Generation
(NLG) is roughly decomposed into two stages: text planning, and text
generation. In the text planning stage, the semantic description of the text is
produced from the conceptual inputs. Then, the text generation system
transforms this semantic description into an actual text. This thesis focuses
on the design and implementation of a Turkish text generation system rather
than text planning. To develop a text generator, we need a linguistic theory
that describes the resources of the desired natural language, and also a
software tool that represents and performs these linguistic resources in a
computational environment. In this thesis, in order to carry out the mentioned
requirements, we have used a functional linguistic theory called
Systemic--Functional Grammar (SFG), and the FUF text generation system as a
software tool. The ultimate text generation system takes the semantic
description of the text sentence by sentence, and then produces a morphological
description for each lexical constituent of the sentence. The morphological
descriptions are worded by a Turkish morphological generator. Because of our
concentration on the text generation, we have not considered the details of the
text planning. Hence, we assume that the semantic description of the text is
produced and lexicalized by an application (currently given by hand). (pdf
copy)