If human brain is really a statistical model of the world it inhabits, in which a Bayes-optimal generative model would detect regularities from the outer world , then would it be worth to study language acquisition through statistics? According to Saffran “during early development, the speed and accuracy with which an organism extracts environmental information can be extremely important for its survival ”, by following this path she discovered how powerful is the way by which infants compute the transitional probability between syllable sequences that have a specific meaning compared to those who haven’t. In our research we try to use some basic tools from statistics to look closer at child phonetic acquisition over a time period that spans from birth to seven. Thanks to a former ANR (Agence nationale pour la recherche) project on french language, we have the opportunity to test hypotheses on a large database made up of longitudinal corpora, every infant being recorded approximatively one hour every month in a natural setting. By analysing the transcription of these data in a specific code, CHAT (Code for the human analysis of transcription), we are elaborating some general statistical measure in order to see the evolution of word numbers and frequencies, as well as the mean lenght of utterances over time. Until now, we’re simply confirming the non linear intrinsic nature of language acquisition: we know that phonetic variations will decrease year after year and the child will be able to speak at 7, but it’s still not clear what’s going on inside these learning stages. Our aim is to see and track “how any variation doesn’t randomly vary in any other but it rather should follow a statistical pattern, as every variation has an order in itself ”. So, we would like to share with the audience our first attempts to use two different methods: “time series” and “conditional random field” in order to see whether and how different results emerging from the same corpora would suggest us the deduction of some possibile constraints and/or preferential path of phonetic unit acquisition based on an hypothetical recurrent coprobability of units’ occurences both intra- corpus (child-specific) and inter-corpora. A synoptic overview of corpora through an overlay of graphs will be shown too. So, any variation would be considered as a temporary achieved structure at a given time in the development that - in turn – would structures the next possible variation, working as a constraint, and so on. Thus, a dynamic causal circle that would allow us to see how different variations could be hypothetically linked together on the basis of a subsequent articulatory and perceptual co-organization over time.

Text data mining and child phonetic variation: a case study on a french corpus

Andrea Briglia
;
Massimo Mucciardi
2019-01-01

Abstract

If human brain is really a statistical model of the world it inhabits, in which a Bayes-optimal generative model would detect regularities from the outer world , then would it be worth to study language acquisition through statistics? According to Saffran “during early development, the speed and accuracy with which an organism extracts environmental information can be extremely important for its survival ”, by following this path she discovered how powerful is the way by which infants compute the transitional probability between syllable sequences that have a specific meaning compared to those who haven’t. In our research we try to use some basic tools from statistics to look closer at child phonetic acquisition over a time period that spans from birth to seven. Thanks to a former ANR (Agence nationale pour la recherche) project on french language, we have the opportunity to test hypotheses on a large database made up of longitudinal corpora, every infant being recorded approximatively one hour every month in a natural setting. By analysing the transcription of these data in a specific code, CHAT (Code for the human analysis of transcription), we are elaborating some general statistical measure in order to see the evolution of word numbers and frequencies, as well as the mean lenght of utterances over time. Until now, we’re simply confirming the non linear intrinsic nature of language acquisition: we know that phonetic variations will decrease year after year and the child will be able to speak at 7, but it’s still not clear what’s going on inside these learning stages. Our aim is to see and track “how any variation doesn’t randomly vary in any other but it rather should follow a statistical pattern, as every variation has an order in itself ”. So, we would like to share with the audience our first attempts to use two different methods: “time series” and “conditional random field” in order to see whether and how different results emerging from the same corpora would suggest us the deduction of some possibile constraints and/or preferential path of phonetic unit acquisition based on an hypothetical recurrent coprobability of units’ occurences both intra- corpus (child-specific) and inter-corpora. A synoptic overview of corpora through an overlay of graphs will be shown too. So, any variation would be considered as a temporary achieved structure at a given time in the development that - in turn – would structures the next possible variation, working as a constraint, and so on. Thus, a dynamic causal circle that would allow us to see how different variations could be hypothetically linked together on the basis of a subsequent articulatory and perceptual co-organization over time.
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11570/3182165
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact