Abstract

In this study, the extractive summarization using sentence embeddings generated by the finetuned BERT (Bidirectional Encoder Representations from Transformers) models and the K-Means clustering method has been investigated. To show how the BERT model can capture the knowledge in specific domains like engineering design and what it can produce after being finetuned based on domain-specific datasets, several BERT models are trained, and the sentence embeddings extracted from the finetuned models are used to generate summaries of a set of papers. Different evaluation methods are then applied to measure the quality of summarization results. Both the automatic evaluation method like Recall-Oriented Understudy for Gisting Evaluation (ROUGE) and the statistical evaluation method are used for the comparison study. The results indicate that the BERT model finetuned with a larger dataset can generate summaries with more domain terminologies than the pretrained BERT model. Moreover, the summaries generated by BERT models have more contents overlapping with original documents than those obtained through other popular non-BERT-based models. It can be concluded that the contextualized representations generated by BERT-based models can capture information in text and have better performance in applications like text summarizations after being trained by domain-specific datasets.

This content is only available via PDF.
You do not currently have access to this content.