ReadNet
A Journey in Language
Data Collection and Analysis
ReadNet is a data collection exercise that has been successfully running for three years, focusing on collecting and analyzing language data to improve literacy outcomes among children. Initially implemented in Hindi (Uttar Pradesh and Rajasthan) and Marathi (Maharashtra), ReadNet aims to expand to other Indian and African languages, leveraging the insights gained from its initial phases.
Children
Audio Hours
Languages
States
The Genesis
The ReadNet dataset comprises approximately 2,500 hours of audio data gathered from 180,000 children across three states in two languages. This dataset fills a gap in children’s audio datasets that can be used to train speech recognition systems with an aim to automate reading assessment. The insights derived from this data can also help in understanding common learning difficulties, learner reading patterns, and regional variations in pronunciation, enabling educators to tailor their teaching methods more effectively.
How ReadNet Works?
ReadNet’s methodology revolves around data collection, analysis, and interpretation. The program gathered data from 180,000 children across three states: Uttar Pradesh, Rajasthan, and Maharashtra. This data, totaling approximately 2,500 hours, includes recordings of children reading aloud, which were then transcribed and analyzed to identify patterns of reading mistakes.
Data Collection
- Regions: Uttar Pradesh, Rajasthan, Maharashtra
- Languages: Hindi, Marathi
- Participants: 180,000 children
- – Uttar Pradesh: 55,187 children
- – Rajasthan: 49,011 children
- – Maharashtra: 79,000 audio data samples collected in Marathi
Data Analysis
- Focus Areas:
- Common reading mistakes
- Confusion between letters and words
- Accent variations
- Word association patterns
- Tools Used: Advanced speech recognition and transcription technologies
Our Impact
Where ReadNet Made a Difference
ReadNet has provided critical insights into the reading abilities and challenges faced by children in the targeted regions. The analysis revealed significant insights like
Letter Confusion
Children often confuse letters with similar sounds, impacting their reading accuracy.
ध is often confused with घ or छ,
व is often confused with ब and vice-versa
Accent Variation
Regional accents contribute to reading errors, particularly in non-standard pronunciations.
We found that a total of 4,174 students have used a different accent at least once.
Word Association
Early readers often rely on familiar word associations, which can be both a help and a hindrance in their literacy journey.
We Found that 9078 students have associated the letter with words at least once.
The ReadNet Journey
2022
Pilot
ASER-trained volunteers used tablets and earphones with microphones to collect high-quality audio data, with regular checks on sampled recordings to ensure quality.
2023
Data Collection & Annotation
Annotation involved a portal and trained annotators for detailed transcription and analysis, with 50% of data annotated twice. The Pratham-HSBC collaboration saw 2,900 volunteers annotate 33,000 samples in 475 hours.
2024
Data Analysis & Reporting
ReadNet’s diverse dataset is pivotal for analyzing children’s speech patterns, training custom speech recognition models, and enhancing automated reading assessment tools.
Latest Research Relevant to ReadNet
ReadNet: Preventing Reading Failure with Speech Recognition-Powered Assessment by MIT
ROAR: The Rapid Online Assessment of Reading by the Stanford Reading & Dyslexia Research Program