Scope of Work: We are seeking a qualified supplier to provide a comprehensive data corpus of annotated conversational Cantonese speech. The project requires at least 260 hours of high-quality Hong Kong-style Cantonese recordings with over 50,000 annotated conversational sentences.
Data Specifications:
- Cantonese Accent: Hong Kong-style, conversational dialogs, oral traditional texts.
- Annotated Sentences: Over 50,000 annotated conversational sentences.
- Recording Duration: At least 260 hours of matching annotated recordings. A single file should not exceed 2 hours; a speaker's total accumulated duration should not exceed 3 hours.
- Silence: Continuous silent periods over 30 seconds should be less than 20%. Long silent periods can be cut.
- Audio Quality: Sampling rate ≥ 16000Hz, bit depth ≥ 16, and average SNR should be higher than 13.
- Spontaneous Speech: Local Hong Kong accent, with no more than 5 people in a conversation.
- Mixed Languages: Each conversation should include English words, or pure Cantonese should not exceed 80%.
- Topic Categorization: 80% or more in news, tech, education, sports, vehicle, nonprofit activity, finance, medical, and healthcare; 20% or less in other topics.
- Speaker Distribution: Balanced gender distribution, ages 18-60.
- Speaker Diarization: Each conversation should be annotated with speaker information (e.g., Speaker A, B, C).
- Accuracy: Annotated texts and audio speech must have a word-level accuracy of 95%+.
Additional Requirements:
- Timestamp for each segment should be less than 25 seconds.
- No personal data should be disclosed, and voiceprint data should not be marked.
- Audio format must be PCM, WAV, FLAC, or m4a.
- A 3-minute sample of your annotated data is required for consideration.
Acceptance Criteria:
- The recorded data must match the annotated texts.
- Data corpus must be delivered within six weeks.
Submission Instructions
Interested suppliers can apply by submitting their information and sample data through the form below.
https://forms.gle/2BCqqov5dhWcuGss8