Scope of Work: We are seeking a qualified supplier to provide a comprehensive data corpus of annotated conversational Cantonese speech. The project requires at least 260 hours of high-quality Hong Kong-style Cantonese recordings with over 50,000 annotated conversational sentences.

Data Specifications:

  1. Cantonese Accent: Hong Kong-style, conversational dialogs, oral traditional texts.
  2. Annotated Sentences: Over 50,000 annotated conversational sentences.
  3. Recording Duration: At least 260 hours of matching annotated recordings. A single file should not exceed 2 hours; a speaker's total accumulated duration should not exceed 3 hours.
  4. Silence: Continuous silent periods over 30 seconds should be less than 20%. Long silent periods can be cut.
  5. Audio Quality: Sampling rate ≥ 16000Hz, bit depth ≥ 16, and average SNR should be higher than 13.
  6. Spontaneous Speech: Local Hong Kong accent, with no more than 5 people in a conversation.
  7. Mixed Languages: Each conversation should include English words, or pure Cantonese should not exceed 80%.
  8. Topic Categorization: 80% or more in news, tech, education, sports, vehicle, nonprofit activity, finance, medical, and healthcare; 20% or less in other topics.
  9. Speaker Distribution: Balanced gender distribution, ages 18-60.
  10. Speaker Diarization: Each conversation should be annotated with speaker information (e.g., Speaker A, B, C).
  11. Accuracy: Annotated texts and audio speech must have a word-level accuracy of 95%+.

Additional Requirements:

  • Timestamp for each segment should be less than 25 seconds.
  • No personal data should be disclosed, and voiceprint data should not be marked.
  • Audio format must be PCM, WAV, FLAC, or m4a.
  • A 3-minute sample of your annotated data is required for consideration.

Acceptance Criteria:

  • The recorded data must match the annotated texts.
  • Data corpus must be delivered within six weeks.

Submission Instructions

Interested suppliers can apply by submitting their information and sample data through the form below.

https://forms.gle/2BCqqov5dhWcuGss8