According to Saffran “during early development, the speed and accuracy with which an organism extracts environmental information can be extremely important for its survival (ref1)”, by following this path she discovered how powerful is the way by which infants compute the transitional probability between syllable sequences that have a specific meaning compared to those who haven’t. COLAJE-Ortolang (2) is an open access french database made up of 7 longitudinal corpora, every infant being videorecorded in a natural setting one hour every month over a time period that spans from birth to six. Data have been transcribed in CHAT, in orthographic, phonetic (IPA) and phonological format, allowing researchers both to test empirical hypotheses on them and to independently reinterpret them. Goal is to verify whether “any variation doesn’t randomly vary in any other but it rather should follow an underlying pattern, as every variation has an order in itself (3)”. First, we mine these corpora using CHAID (Chi squared automatic interaction detection): a decision tree technique conceived to overcome in a non-parametric way the problem of multiple comparisons. By testing how a supposed dependent variable (phonetic variation) is dependent to an independent variable (time + utterance’s lenght) the algorithm iteratively form subsequent small sub-groups. The limit of this method is that it doesn’t take into account morphological differences between phonetic units (e.g a bilabial from an occlusive-liquid): in order get over this problem we choose to evaluate its validity by interpreting its results through the lenses of a “consonant acquisition chart”(4) and some considerations on first language acquisition specific to french (5). We find that a significant number of variations has been properly detected by CHAID: sub groups with different variations’s rate reflect what infants are normally able to perceive and articulate at the different ages taken into account. Then, we write some scripts with Python in order to analyse every longitudinal corpus according to a hierarchy of phonetic units, we put them into a “multistream graph ”(6) that allows us to visually obtain an overview of the whole development over time: this could gives us a possible way to deduce constraints and/or preferential paths of phonetic unit acquisition based on an hypothetical recurrent coprobability of units’ occurences both intra- corpus (child-specific) and inter-corpora.

Data mining and first language acquisition: a case study on a French corpus

Andrea Briglia
Co-primo
Membro del Collaboration Group
;
Massimo Mucciardi
Co-primo
Membro del Collaboration Group
;
2020-01-01

Abstract

According to Saffran “during early development, the speed and accuracy with which an organism extracts environmental information can be extremely important for its survival (ref1)”, by following this path she discovered how powerful is the way by which infants compute the transitional probability between syllable sequences that have a specific meaning compared to those who haven’t. COLAJE-Ortolang (2) is an open access french database made up of 7 longitudinal corpora, every infant being videorecorded in a natural setting one hour every month over a time period that spans from birth to six. Data have been transcribed in CHAT, in orthographic, phonetic (IPA) and phonological format, allowing researchers both to test empirical hypotheses on them and to independently reinterpret them. Goal is to verify whether “any variation doesn’t randomly vary in any other but it rather should follow an underlying pattern, as every variation has an order in itself (3)”. First, we mine these corpora using CHAID (Chi squared automatic interaction detection): a decision tree technique conceived to overcome in a non-parametric way the problem of multiple comparisons. By testing how a supposed dependent variable (phonetic variation) is dependent to an independent variable (time + utterance’s lenght) the algorithm iteratively form subsequent small sub-groups. The limit of this method is that it doesn’t take into account morphological differences between phonetic units (e.g a bilabial from an occlusive-liquid): in order get over this problem we choose to evaluate its validity by interpreting its results through the lenses of a “consonant acquisition chart”(4) and some considerations on first language acquisition specific to french (5). We find that a significant number of variations has been properly detected by CHAID: sub groups with different variations’s rate reflect what infants are normally able to perceive and articulate at the different ages taken into account. Then, we write some scripts with Python in order to analyse every longitudinal corpus according to a hierarchy of phonetic units, we put them into a “multistream graph ”(6) that allows us to visually obtain an overview of the whole development over time: this could gives us a possible way to deduce constraints and/or preferential paths of phonetic unit acquisition based on an hypothetical recurrent coprobability of units’ occurences both intra- corpus (child-specific) and inter-corpora.
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11570/3182321
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact