用Lucene创建PDF文档的索引时,需要引入第三方的PDFBox包,目前PDFBox的最新版本为v-0.8.0,其创建过程的主要代码如下:

添加Field.

Document document = LucenePDFDocument.getDocument(fileName); ;
document.add(new Field(FIELD_PATH,fileName.getPath(),Field.Store.YES, Field.Index.UN_TOKENIZED));
document.add(new Field(FIELD_CONTENT,readPDF(fileName.getPath()), Field.Store.YES, Field.Index.TOKENIZED));

读取PDF文件

public String readPDF(String filePath) throws Exception {

StringBuffer content = new StringBuffer();
FileInputStream fis = new FileInputStream(filePath);
PDFParser parser = new PDFParser(fis);
parser.parse();
PDDocument pdd = parser.getPDDocument();
PDFTextStripper ts = new PDFTextStripper();
content.append(ts.getText(pdd));
pdd.close();
fis.close();

return content.toString().trim();
}

添加document.

IndexWriter indexWriter = new IndexWriter(indexDir,new StandardAnalyzer(),true);
indexWriter.addDocument(document);

Okay,如果有其他格式的文件,比如html,word,excel,mp3等文件格式,也需要对应的第三方包的支持,索引过程大致一样,就不在此赘述.

VN:F [1.7.9_1023]
Rating: 0.0/10 (0 votes cast)
VN:F [1.7.9_1023]
Rating: 0 (from 0 votes)

标签:,

Leave your comment