Structuring Arabic lexical and morphological resources using TEI: theory and practice

By Angelo Mario Del Grosso, Ouafae Nahli


An Arabic word can be described according to its lexical and its morphological information.
Lexical information, conveyed by the root, consists of both semantic meaning and syntactic properties (e.g. parts of speech). Whereas, morphological information, encoded by patterns, is useful to group the words having similar syntactic, inflectional and semantic behaviour.

The lexical analysis and morphological analysis were distinctly described from the very first studies of Arabic language. Although several scholarly works illustrate Arabic lexicon models encoding semantic meanings, a systematic description of word patterns continues to be very lacking.
In this work, we have designed an exhaustive resource consisting of two levels: lexical and morphological. The lexical level collects information extracted from the dictionary al=qāmūs al=muḥīṭ. The morphological level describes patterns formalization which allows to enrich word descriptions with additional semantic, morphosyntactic and inflectional information.

In order to build our digital resource, taking into account primary source, lexical requirements, and reusability, we followed the guidelines provided by the Text Encoding Initiative (TEI). We adopted the TEI module devoted to encoding digital dictionaries and lexicons to formally represent the medieval al=qāmūs al=muḥīṭ dictionary. Given the complexity to describe morphological information extant in the patterns, we also used the TEI module devoted to encoding feature structures.

According to the obtained model, we can build an exhaustive resource which is composed of two components the lexical block and the morphological block. These two components are distinct but complementary resources, in which lexical data is connected to morphological information.
In addition, the morphological resource can be used as a stand-alone tool allowing morphological analyzers to capture aspects of meaning that are not captured by current systems.

Full Text:


International Journal of Information Science and Technology (iJIST) – ISSN: 2550-5114