Journal of Educational Media and Library Sciences


Vol. 40 No. 3 , Pages 325 - 344 , 2003

Construction and Application of a Chinese OCR Test Collection for Information Retrieval (Article written in chinese)

Mung-Chu TSAI & Yuen-Hsien TSENG

Abstract

This article describes the process of constructing a Chinese OCR test collection and the application of this collection in an retrieval experiment. We have overcome the difficulty of obtaining past information need for retrospective data and created 30 query topics that simulate real user needs. To obtain real OCR documents instead of simulated ones, we have converted 8439 full-text images into 8439 OCR text files. An evaluation of the OCR documents reveals an average of 70% of recognition accuracy. To obtain the relevant documents for each query, we invited 3 judges to examine each of 8439 images and give relevance score to each document for each topic. According to Kendall’s statistical coefficient, highly consistent judgments are obtained in 20 query topics. Finally in our experiment with 12 search strategies, our results show that the retrieval effectiveness of OCR documents decrease to 70% when the recognition accuracy is about 70%.

Keywords: OCR; information retrieval; test collection; effectiveness evaluation; Chinese document retrieval

[Chinese Version | Index | Journal of Educational Media and Library Sciences | Other Journals | Subscription form | Enquiry ]


Mail any comments and suggestions to hkier-journal@cuhk.edu.hk .