- TIKA 教程
- TIKA - 主页
- TIKA - 概述
- TIKA - 架构
- TIKA - 环境
- TIKA - 引用 API
- TIKA - 文件格式
- TIKA - 文档类型检测
- TIKA - 内容解压
- TIKA - 元数据解压
- TIKA - 语言检测
- TIKA - GUI
- TIKA 实用资源
- TIKA - 快速指南
- TIKA - 实用资源
- TIKA - 讨论
TIKA - 提取文本文件
以下是程序,用于摘取文本文档中的内容和元数据 −
import java.io.File; import java.io.FileInputStream; import java.io.IOException; import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.ParseContext; import org.apache.tika.sax.BodyContentHandler; import org.apache.tika.parser.txt.TXTParser; import org.xml.sax.SAXException; public class TextParser { public static void main(final String[] args) throws IOException,SAXException, TikaException { //detecting the file type BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(new File("example.txt")); ParseContext pcontext=new ParseContext(); //Text document parser TXTParser TexTParser = new TXTParser(); TexTParser.parse(inputstream, handler, metadata,pcontext); System.out.println("Contents of the document:" + handler.toString()); System.out.println("Metadata of the document:"); String[] metadataNames = metadata.names(); for(String name : metadataNames) { System.out.println(name + " : " + metadata.get(name)); } } }
将以上代码另存为 TextParser.java,并使用以下命令通过命令提示符编译它 −
javac TextParser.java java TextParser
以下是 sample.txt 文件的快照 −
文本文档具有以下属性 −
如果你执行以上程序,它将提供以下输出。
输出 −
Contents of the document: At tutorialspoint.com, we strive hard to provide quality tutorials for self-learning purpose in the domains of Academics, Information Technology, Management and Computer Programming Languages. The endeavour started by Mohtashim, an AMU alumni, who is the founder and the managing director of Tutorials Point (I) Pvt. Ltd. He came up with the website tutorialspoint.com in year 2006 with the help of handpicked freelancers, with an array of tutorials for computer programming languages. Metadata of the document: Content-Encoding: windows-1252 Content-Type: text/plain; charset = windows-1252
广告