Skip to main content
The paper focuses on the modelling of multiword expressions (MWE) in Bulgarian-English parallel news corpora (SETimes; CSLI dataset and PennTreebank dataset). Observations were made on alignments in which at least one multiword expression... more
The paper focuses on the modelling of multiword expressions (MWE) in Bulgarian-English parallel news corpora (SETimes; CSLI dataset and PennTreebank dataset). Observations were made on alignments in which at least one multiword expression was used per language. The multiword expressions were classified with respect to the PARSEME lexicon-based (WG1) and treebank-based (WG4) classifications. The non-MWE counterparts of MWEs are also considered. Our approach is data-driven because the data of this study was retrieved from parallel corpora and not from bilingual dictionaries. The survey shows that the predominant translation relation between Bulgarian and English is MWE-to-word, and that this relation does not exclude other translation options. To formalize our observations, a catenae-based modelling of the parallel pairs is proposed.
Research Interests:
Download (.pdf)
The paper focuses on the traditional understanding of the syntactic relations that are relevant for Bulgarian. These are: agreement, government, prepositional linking and apposition. Although in the Bulgarian linguistic literature there... more
The paper focuses on the traditional understanding of the syntactic relations that are relevant for Bulgarian. These are: agreement, government, prepositional linking and apposition. Although in the Bulgarian linguistic literature there exist some similar observations, they are not quite consistently discussed. This survey suggests a structured typology of the syntactic relations. It shows more systematically that the surface syntactic relation might differ from the underlying one and that there is a possibility of accommodating two syntactic relations among two lexical elements that make a constituent.
Research Interests:
Download (.pdf)
Download (.pdf)
Download (.pdf)
Download (.pdf)
Download (.pdf)
In this paper we are reporting about an ongoing project LT4eL (Language Technolohy for eLearning) aiming at improving the effectiveness of retrieval and accessibility of learning objects within a learning management system. We elaborate... more
In this paper we are reporting about an ongoing project LT4eL (Language Technolohy for eLearning) aiming at improving the effectiveness of retrieval and accessibility of learning objects within a learning management system. We elaborate the process of building the domain ontology and present the multilingual support offered to the application.
The paper outlines a hybrid architecture for a partial parser based on regular grammars over XML documents. The parser is used to support the annotation process in the BulTreeBank project. Thus the parser annotates only the... more
The paper outlines a hybrid architecture for a partial parser based on regular grammars over XML documents. The parser is used to support the annotation process in the BulTreeBank project. Thus the parser annotates only the 'sure' cases. To maximize the number of the analyzed phrases the parser applies a set of grammars in a dynamic fashion. Each grammar determines not only the constituent structure (plus some syntactic dependencies internal to the structure), but also a description of the local and global context of the recognized phrase. The grammars available to the parser are arranged in a network. The order of the grammars application depends on the initial ordering in the network and the descriptions associated with the grammars. Thus the traverse is not deterministic. Additionally, the application of the grammars can be interleaved with the applications of other XML tools like remove, insert and transform operations. This architecture provides a flexible means for g...
Download (.pdf)
The paper discusses shallow semantic annotation of Bulgarian treebank. Our goal is to construct the next layer of linguistic interpretation over the morphological and syntactic layers that have already been encoded in the treebank. The... more
The paper discusses shallow semantic annotation of Bulgarian treebank. Our goal is to construct the next layer of linguistic interpretation over the morphological and syntactic layers that have already been encoded in the treebank. The annotation is called shallow because it encodes only the senses for the non-functional words and the relations between the semantic indices connected to them. We
Download (.pdf)
One of the goals of the “Language Technology for LifeLong Learning” project is the creation of an appropriate methodology to support both formal and informal learning. Services are being developed that are based on the interaction between... more
One of the goals of the “Language Technology for LifeLong Learning” project is the creation of an appropriate methodology to support both formal and informal learning. Services are being developed that are based on the interaction between a formal representation of (domain) knowledge in the form of an ontology created by experts and a social component which complements it, that is tags and social networks. It is expected that this combination will improve learner interaction, knowledge discovery as well as knowledge co- ...
Research Interests:
Download (.pdf)
Download (.pdf)
Download (.pdf)
The paper describes an approach for semantic annotation of multimedia objects stored in a Digital Library implemented as a Web Service. The Library has its own fixed annotation schema and provides a set of functions accessible as Web... more
The paper describes an approach for semantic annotation of multimedia objects stored in a Digital Library implemented as a Web Service. The Library has its own fixed annotation schema and provides a set of functions accessible as Web Service operations. The main objective of semantic annotations (supported by ontologies) is to extend both the Library functionality and the scope of the knowledge in it.
Download (.pdf)
In this paper we aim at outlining the joint exploitation of two nominal grammars - named-entity grammar and chunk grammar in the process of building a treebank. Their contribution towards unified and effective NP shallow parser is... more
In this paper we aim at outlining the joint exploitation of two nominal grammars - named-entity grammar and chunk grammar in the process of building a treebank. Their contribution towards unified and effective NP shallow parser is stressed upon. Taking into account their specific underlying principles, the points of interrelation are discussed and related problems are pointed out.
Download (.pdf)
Download (.pdf)
In this paper we are reporting about an ongoing project LT4eL (Language Technolohy for eLearning) aiming at improving the effectiveness of retrieval and accessibility of learning objects within a learning management system. We elaborate... more
In this paper we are reporting about an ongoing project LT4eL (Language Technolohy for eLearning) aiming at improving the effectiveness of retrieval and accessibility of learning objects within a learning management system. We elaborate the process of building the domain ontology and present the multilingual support offered to the application.
This paper addresses the problem of efficient resources compilation for less-processed languages. It presents a strategy for the creation of a morpho-syntactically tagged corpus with respect to such languages. Due to the fact that human... more
This paper addresses the problem of efficient resources compilation for less-processed languages. It presents a strategy for the creation of a morpho-syntactically tagged corpus with respect to such languages. Due to the fact that human languages are morphologically non- homogenous, we mainly focus on inflecting ones. With certain modifications, the model can be applied to the other types as well. The strategy is described within a certain implementational environment - the CLaRK System. First, the general architecture of the software is described. Then, the usual steps towards the creation of the language resource are outlined. After that, the concrete imlementational properties of the processing steps within CLaRK are discussed: text archive compilation, tokenization, frequency word list creation, morphological lexicon creation, morphological analyzer, semi-automatic disambiguation.
Download (.pdf)
This paper discusses in detail the design and implementation phases during the creation of the Bulgarian HPSG-based treebank (BulTreeBank). First, the interconnection of the HPSG language model, the linguistic parameters of the annotation... more
This paper discusses in detail the design and implementation phases during the creation of the Bulgarian HPSG-based treebank (BulTreeBank). First, the interconnection of the HPSG language model, the linguistic parameters of the annotation scheme and the underlying ...
The paper outlines a hybrid architecture for a partial parser based on regular grammars over XML documents. The parser is used to support the annotation process in the BulTreeBank project. Thus the parser annotates only the... more
The paper outlines a hybrid architecture for a partial parser based on regular grammars over XML documents. The parser is used to support the annotation process in the BulTreeBank project. Thus the parser annotates only the 'sure' cases. To maximize the number of the analyzed phrases the parser applies a set of grammars in a dynamic fashion. Each grammar determines not only the constituent structure (plus some syntactic dependencies internal to the structure), but also a description of the local and global context of the recognized phrase. The grammars available to the parser are arranged in a network. The order of the grammars application depends on the initial ordering in the network and the descriptions associated with the grammars. Thus the traverse is not deterministic. Additionally, the application of the grammars can be interleaved with the applications of other XML tools like remove, insert and transform operations. This architecture provides a flexible means for g...
Download (.pdf)
Download (.pdf)
Download (.pdf)
Reliable automatic semantic annotation systems do not exist for many languages. Their creation depends in many respects on construction of gold standard corpora. In this paper we present a system for supporting the semi-automatic... more
Reliable automatic semantic annotation systems do not exist for many languages. Their creation depends in many respects on construction of gold standard corpora. In this paper we present a system for supporting the semi-automatic construction of such corpora. The ...
Download (.pdf)
@Book{AEPC:2011, editor = {Kiril Simov and Petya Osenova and Jörg Tiedemann and Radovan Garabik}, title = {Proceedings of The Second Workshop on Annotation and Exploitation of Parallel Corpora}, month = {September}, year = {2011}, address... more
@Book{AEPC:2011, editor = {Kiril Simov and Petya Osenova and Jörg Tiedemann and Radovan Garabik}, title = {Proceedings of The Second Workshop on Annotation and Exploitation of Parallel Corpora}, month = {September}, year = {2011}, address = {Hissar, Bulgaria}, url ...
Download (.pdf)
Download (.pdf)
Download (.pdf)
Download (.pdf)
Download (.pdf)
Download (.pdf)
Download (.pdf)
The paper presents the term head from various points of view. On the one hand, the perspectives on the head have been discussed in the context of different theoretical approaches. On the other hand, the content of the term has been... more
The paper presents the term head from various points of view. On the one hand, the perspectives on the head have been discussed in the context of different theoretical approaches. On the other hand, the content of the term has been presented with a view to the criteria for its detection and also to the complexity of the language data.
Research Interests:
Download (.pdf)
The paper presents the strategies and conversion principles of BulTreeBank into Universal Dependencies annotation scheme. The mappings are discussed from linguistic and technical point of view. The mapping from the original resource to... more
The paper presents the strategies and conversion principles of BulTreeBank into Universal Dependencies annotation scheme. The mappings are discussed from linguistic and technical point of view. The mapping from the original resource to the new one has been done on morphological and syntactic level. The first release of the treebank was issued in May 2015. It contains
125 000 tokens, which cover roughly half of the corpus data.
Research Interests:
Download (.pdf)
The notion of catena was introduced originally to represent the syntactic structure of multiword expressions with idiosyncratic semantics and non-constituent structure. Later on, several other phenomena (such as ellipsis, verbal... more
The notion of catena was introduced originally to represent the syntactic structure of multiword expressions with idiosyncratic semantics and non-constituent structure. Later on, several other phenomena (such as ellipsis, verbal complexes, etc.) were
formalized as catenae. This naturally led to the suggestion that a catena can be considered a basic unit of syntax. In this paper
we present a formalization of catenae and the main operations over them for modelling the combinatorial potential of units
in dependency grammar.
Research Interests:
Download (.pdf)
Download (.pdf)
We investigate language contact effects between Bulgarian dialects on the one hand, and the languages of the countries bordering Bulgaria on the other. The Bulgarian data comes from Stojkov's Bulgarian Dialect Atlases. We investigate... more
We investigate language contact effects between Bulgarian dialects on the one hand, and the languages of the countries bordering Bulgaria on the other. The Bulgarian data comes from Stojkov's Bulgarian Dialect Atlases. We investigate three techniques to detect contact effects in pronunciation, the phone frequency method and the feature frequency method, both of which are insensitive to the order of phonological segments within words, and also Levenshtein distance, a word-based method which is order-sensitive.
Download (.pdf)
The paper focuses on the sense annotation of BulTreeBank. It discusses three levels of annotation: valency frames, lexical senses and DBPedia URIs. The lexical sense annotation is considered in more detail and in relation to the other two... more
The paper focuses on the sense annotation of BulTreeBank. It discusses three levels of annotation: valency frames, lexical senses and DBPedia URIs. The lexical sense annotation is considered in more detail and in relation to the other two processes. Special attention is paid to the quality validation with respect to two aspects: inter-annotator agreement and cross-resource control.
Research Interests:
Download (.pdf)
The paper discusses the syntactic relations within Bulgarian complex words with a verbal root as head and nominal or adverbial root as dependant. The parts-of speech considered are verbs and verbal nouns. The corpus-based survey shows... more
The paper discusses the syntactic relations within Bulgarian complex words with a verbal root as head and nominal or adverbial root as dependant. The parts-of speech considered are verbs and verbal nouns. The corpus-based survey shows that verbal nouns allow more often the existence of word-internal arguments than verbs. The data presents both cases: when any word-external arguments are blocked, and when such arguments are allowed.
Research Interests:
Download (.pdf)
The paper presents the semantic modeling of Bulgarian parts-of-speech within the theory of Head-driven Phrase Structure Grammar (HPSG). The parts-of-speech are divided into two main groups: referents and events. The referents are... more
The paper presents the semantic modeling of Bulgarian parts-of-speech within the theory of Head-driven Phrase Structure Grammar (HPSG). The parts-of-speech are divided into two main groups:  referents and events. The referents are indicated by the nouns, while the events are presented through verbs, prepositions, adjectives, numerals, adverbs. Since HPSG is a monostratal linguistic theory, the morphosyntactic and semantic information are presented at one level and in close relation to each other. In such a context, the challenge is that Bulgarian is a language with rich morphology, and the model has to balance between the grammar and semantics respectively.
Research Interests:
Download (.docx)
Having being proposed for the fourth time, the QA at CLEF track has confirmed a still raising interest from the research community, recording a constant increase both in the number of participants and submissions. In 2006, two pilot... more
Having being proposed for the fourth time, the QA at CLEF track has confirmed a still raising interest from the research community, recording a constant increase both in the number of participants and submissions. In 2006, two pilot tasks, WiQA and AVE, were proposed beside the main tasks, representing two promising experiments for the future of QA.Also in the main task some significant innovations were introduced, namely list questions and requiring text snippet(s) to support the exact answers. Although this had an impact on the work load of the organizers both to prepare the question sets and especially to evaluate the submitted runs, it had no significant influence on the performance of the systems, which registered a higher Best accuracy than in the previous campaign, both in monolingual and bilingual tasks. In this paper the preparation of the test set and the evaluation process are described, together with a detailed presentation of the results for each of the languages. The pilot tasks WiQA and AVE will be presented in dedicated articles.
Download (.pdf)
Download (.pdf)
Download (.pdf)
Download (.pdf)
Download (.pdf)
Download (.pdf)
Download (.pdf)
Research Interests:
Download (.pdf)
Research Interests:
Download (.pdf)
(с оглед на автоматичната обработка на естествен език)
Research Interests:
Download (.pdf)
This report is an extension of a paper published at RANLP conference 2001. The paper is an improvement over the work done on POS disambiguation for Bulgarian via Neural Networks (Vlasseva 1999). Our improvements are in several directions:... more
This report is an extension of a paper published at RANLP conference 2001. The paper is an improvement over the work done on POS disambiguation for Bulgarian via Neural Networks (Vlasseva 1999). Our improvements are in several directions: (1) we extended the range of grammatical features predicted by the system to cover almost all paradigmatic members of Bulgarian words, (2) we changed the encoding schemata for grammatical features in order to minimize the computation and to use more extensively the context layer of the network, (3) we changed the evaluation of the network output in order to minimize the side effects from evaluating cases that are not relevant in a particular instance of ambiguity. Besides the improvements when using neural networks, we did some improvements on the choice of the training corpus and we added a rule-based preprocessing component in order to disambiguate the cases for which there are rules ensuring 100% correct results
Research Interests:
Download (.pdf)