ODISAE Optimizing Digital Interaction with a Social and Automated Environment
Optimizing Digital Interaction with a Social and Automated Environment
This page describes the main contributions of the LINA-TALN team in the ODISAE’s project.
The official web page of the project is www.odisae.com.
The present work was supported by the Unique Interministerial Fund (FUI) 17.
The ODISAE project aims at developing a semantic analyser of written online conversations across several modalities (i.e. chat, forum, email) in the context of CRM (Customer Relation Management). These capabilities are : multi-modal text information retrieval (e.g. finding the solution to a problem in a modality different from the one in which the request was formulated), automated FAQ and documentation management (e.g. automatic detection of the absence of a suitable solution to a recurring request), automated assistance generation (e.g. helping users to formulate problems, evaluating answers’ exhaustivity), or conversation supervision (e.g. detecting attrition, irritation).
We report here the main contributions the LINA participated.
Collaboratively with Eptica and Kwaga, we define a model to describe the structure and the meta-data of the conversations.
The model is based on the observation of the collected corpus (data-driven approach). Three source were concerned : forum, email and irc.
The model is generic and aims at describing the conversations independently from the communication channel.
The model was instanciated in an exchange format (an XML W3C Schema was defined on purpose).
Interactions were described in terms of topics, opinion polarity and dialogue acts carried by the messages. The work was carried out in collaboration with Eptica and Kwaga.
The description results in a Apache UIMA type system definition.
The corpus is made of online written conversations from the same domain and coming from three sources : ubuntu-fr-forum , ubuntu-fr-irc , ubuntu-fr-email .
Ubuntu-fr-forum and ubuntu-fr-email were collected from July
2014 to June 2015 and from January 2015 to June 2015 for ubuntu-fr-irc.
Conversations are stored in the exchange format proposed by ODISAE project .
A file per conversation.
The content of the IRC room was splitted into conversations by using heuristics on the message authors and the time delay between messages.
Messages from ubuntu-fr-email were gathered into conversations by using the email inreplyto header field as well as the subject and period similarity.
Participant meta-data (username, realname, email) were anonymised. Each participant is identified by a unique identifier.
,  and  do not specify a license for the data they make available.
Consequently the author copyright should be applied by default to the messages content.
In , Ubuntu-fr plays the role of the publisher.
 is host by the FreeNode servers but the channel belongs to Ubuntu-fr which registered it. There is no archiving procedure for these data except for .
The user delegates the use of his/her messages in the Ubuntu-fr database.
If the user expresses his/her refusal of participating in any other exploitation of his/her material, the person or the organisation in possession of this material would have to remove it from any of his/her copy(ies).
The content must not be indexed by any search engines in Internet.
In the ODISAE project, the components and the main analysis workflow were developed under the Apache UIMA framework.
Models were trained and used through the ClearTK framework which provides a common interface and wrappers for popular machine learning libraries such as SVMlight, LIBSVM, LIBLINEAR, OpenNLP MaxEnt, and Mallet.
We used Apache RUTA for performing a rule-based post-processing to increase the precision and fix known issues.
Some of the basic Natural Language Processing were performed using the DKPro core components
The labels are based on the DIT++ taxonomy of Dialogue Acts.
DIT++ proposes to annotate dialogue acts in terms of communicative functions related to some semantic content also named the dimension.
For the present work, two distinct and independent models were built : One for communicative function, the other for dimension.
Models were trained using 3000 utterances comming from three distinct modalities (IRC, mail and forum) of the ubuntu-fr corpus (1000 of each modality).
Models are licensed under the Apache v2 license.
The team was involved in some communications.