The “constrained statistical learning framework” was proposed by J. Saffran to reshape the long-standing debate on language acquisition (Nature vs Nurture). By defining the concept of “statistically biased learning mechanisms” she posits a new way to account to the “linguistic genius” of babies: her ground-breaking experimental designs showed how babies are able to identify sound sequences and statistical regularities comparing natural and artificial languages. According to her “it is possible that a combination of inherent constraints on the type of patterns acquired by learners, and the use of output from one level of learning as input to the next, may help to explain why something so complex is mastered readily by the human mind”. By following this route, goal of our research is to test whether and how “any variation doesn’t randomly vary in any other but it rather should follow an underlying pattern, as every variation has an order in itself” (Sauvage, 2015). Data. Colaje- Ortolang is an open access french database part of CHILDESS: seven children have been recorded in a natural setting one hour every month, from their first months of life until seven years-old. Data are available in three different formats : IPA, orthographic and CHAT, each of them is aligned to the correspondent video, allowing researchers to see the source and to eventually reinterpret it on their own. The main coding structure consists in a fundamental division between “pho” (what the infant says) and “mod” (what the infant should have said according to the adult’s standard phonetic/phonological norm): we define “variation” every occurrence in which “pho” differs from “mod”. Methodology. First, we used CHAID (chi squared automatic interaction detection, Kass 1980) to get a general insight on how phonetic variation rate changes over time and which kind of phonetic units are correctly articulated and which are not. The main limit of this decision tree technique is that it doesn’t take into account morphological differences between phonemes (e.g a bilabial is considered equal to an occlusive-liquid). In a second step, in order to overcome this issue we used Python to write some scripts able to track and quantify target phonemes from a pre-determined morphology-based list of phonetic units: thus, each of the infant’s longitudinal series of trancripts has been subdivided according to this structure, allowing us to easily compare its validity to cross-linguistic results from a selected review article (McLeod et al. 2018). To test whether and how the paths of phonetic units variations over time would follow an underlying pattern of co-probability of occurrences between phonemes, we have decided to turn our results into a “Multistream graph”: this allows us to get an overview of the whole acquisition process over time and, at the same time, to focus into details in order to see when, where and how a phonetic variation has occurred and, subsequently, the path by which it has evolved into the phonetic norm. Results. We found a strong variability between infants: the learning rate are very different and exceptions such as « regression » would suggest us a non linear nature of acquisition. We propose to use Clement’s “Theory of traits”, in particular the concepts of « feature hierarchy », « markedness avoidance » and « feature economy » to try to give a structural account to the data mining results. A synoptic overview of the seven infants’ learning paths through “Multistream graph” will be provided, and a discussion on the generalizability of our results will be welcomed.
Mining longitudinal corpus of spontaneous speech: a graph-based approach to child phonetic variation.
Andrea Briglia
;Massimo Mucciardi;
2020-01-01
Abstract
The “constrained statistical learning framework” was proposed by J. Saffran to reshape the long-standing debate on language acquisition (Nature vs Nurture). By defining the concept of “statistically biased learning mechanisms” she posits a new way to account to the “linguistic genius” of babies: her ground-breaking experimental designs showed how babies are able to identify sound sequences and statistical regularities comparing natural and artificial languages. According to her “it is possible that a combination of inherent constraints on the type of patterns acquired by learners, and the use of output from one level of learning as input to the next, may help to explain why something so complex is mastered readily by the human mind”. By following this route, goal of our research is to test whether and how “any variation doesn’t randomly vary in any other but it rather should follow an underlying pattern, as every variation has an order in itself” (Sauvage, 2015). Data. Colaje- Ortolang is an open access french database part of CHILDESS: seven children have been recorded in a natural setting one hour every month, from their first months of life until seven years-old. Data are available in three different formats : IPA, orthographic and CHAT, each of them is aligned to the correspondent video, allowing researchers to see the source and to eventually reinterpret it on their own. The main coding structure consists in a fundamental division between “pho” (what the infant says) and “mod” (what the infant should have said according to the adult’s standard phonetic/phonological norm): we define “variation” every occurrence in which “pho” differs from “mod”. Methodology. First, we used CHAID (chi squared automatic interaction detection, Kass 1980) to get a general insight on how phonetic variation rate changes over time and which kind of phonetic units are correctly articulated and which are not. The main limit of this decision tree technique is that it doesn’t take into account morphological differences between phonemes (e.g a bilabial is considered equal to an occlusive-liquid). In a second step, in order to overcome this issue we used Python to write some scripts able to track and quantify target phonemes from a pre-determined morphology-based list of phonetic units: thus, each of the infant’s longitudinal series of trancripts has been subdivided according to this structure, allowing us to easily compare its validity to cross-linguistic results from a selected review article (McLeod et al. 2018). To test whether and how the paths of phonetic units variations over time would follow an underlying pattern of co-probability of occurrences between phonemes, we have decided to turn our results into a “Multistream graph”: this allows us to get an overview of the whole acquisition process over time and, at the same time, to focus into details in order to see when, where and how a phonetic variation has occurred and, subsequently, the path by which it has evolved into the phonetic norm. Results. We found a strong variability between infants: the learning rate are very different and exceptions such as « regression » would suggest us a non linear nature of acquisition. We propose to use Clement’s “Theory of traits”, in particular the concepts of « feature hierarchy », « markedness avoidance » and « feature economy » to try to give a structural account to the data mining results. A synoptic overview of the seven infants’ learning paths through “Multistream graph” will be provided, and a discussion on the generalizability of our results will be welcomed.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.