ReadNet

A Journey in Language
Data Collection and Analysis

ReadNet is a data collection exercise that has been successfully running for three years, focusing on collecting and analyzing language data to improve literacy outcomes among children. Initially implemented in Hindi (Uttar Pradesh and Rajasthan) and Marathi (Maharashtra), ReadNet aims to expand to other Indian and African languages, leveraging the insights gained from its initial phases.

180,000

Children

2,500

Audio Hours

Languages

States

View the report

The Genesis

The ReadNet dataset comprises approximately 2,500 hours of audio data gathered from 180,000 children across three states in two languages. This dataset fills a gap in children’s audio datasets that can be used to train speech recognition systems with an aim to automate reading assessment. The insights derived from this data can also help in understanding common learning difficulties, learner reading patterns, and regional variations in pronunciation, enabling educators to tailor their teaching methods more effectively.

How ReadNet Works?

ReadNet’s methodology revolves around data collection, analysis, and interpretation. The program gathered data from 180,000 children across three states: Uttar Pradesh, Rajasthan, and Maharashtra. This data, totaling approximately 2,500 hours, includes recordings of children reading aloud, which were then transcribed and analyzed to identify patterns of reading mistakes.

Data Collection

Regions: Uttar Pradesh, Rajasthan, Maharashtra
Languages: Hindi, Marathi
Participants: 180,000 children
- – Uttar Pradesh: 55,187 children
- – Rajasthan: 49,011 children
- – Maharashtra: 79,000 audio data samples collected in Marathi

Data Analysis

Focus Areas:
- Common reading mistakes
- Confusion between letters and words
- Accent variations
- Word association patterns
Tools Used: Advanced speech recognition and transcription technologies

Our Impact

Where ReadNet Made a Difference

ReadNet has provided critical insights into the reading abilities and challenges faced by children in the targeted regions. The analysis revealed significant insights like

Letter Confusion

Children often confuse letters with similar sounds, impacting their reading accuracy.

ध is often confused with घ or छ,
व is often confused with ब and vice-versa

Accent Variation

Regional accents contribute to reading errors, particularly in non-standard pronunciations.

We found that a total of 4,174 students have used a different accent at least once.

Word Association

Early readers often rely on familiar word associations, which can be both a help and a hindrance in their literacy journey.

We Found that 9078 students have associated the letter with words at least once.

The ReadNet Journey

2022

Pilot

ASER-trained volunteers used tablets and earphones with microphones to collect high-quality audio data, with regular checks on sampled recordings to ensure quality.

2023

Data Collection & Annotation

Annotation involved a portal and trained annotators for detailed transcription and analysis, with 50% of data annotated twice. The Pratham-HSBC collaboration saw 2,900 volunteers annotate 33,000 samples in 475 hours.

2024

Data Analysis & Reporting

ReadNet’s diverse dataset is pivotal for analyzing children’s speech patterns, training custom speech recognition models, and enhancing automated reading assessment tools.

Latest Research Relevant to ReadNet

ReadNet: Preventing Reading Failure with Speech Recognition-Powered Assessment by MIT

https://mitili.mit.edu/research/readnet-preventing-reading-failure-speech-recognition-powered-assessment — Click here

ROAR: The Rapid Online Assessment of Reading by the Stanford Reading & Dyslexia Research Program