How the knowledge organization is implemented?

#22
by hxypqr - opened

I don't quite understand how knowledge distillation is implemented here.

Whisper is trained on 680,000 hours of untagged data for autoregression. According to the content of the fourth section of the paper, our model is trained on 21,170 hours of data with pseudo-labels generated by Whisper, with the first and 32nd layer parameters frozen based on Whisper. This means that our model only needs to go through 21,170 hours of data with pseudo-labels and a model structure similar to Whisper, freezing the first and 32nd layers, using weighted KL divergence and label cross-entropy to achieve good results?

If this is the case, it is indeed a significant discovery, indicating that we can always reduce the model's parameters and inference time after pre-training the model using similar methods, without significant loss of accuracy.

Thank you in advance

Sign up or log in to comment