OpenNLP - 句子检测

在处理自然语言时，确定句子的开始和结束是需要解决的问题之一。这个过程被称为句界分歧（SBD）或简单地称为句子分割。

我们用来检测给定文本中句子的技术，取决于文本的语言。

使用 Java 进行句子检测

我们可以使用正则表达式和一组简单的规则来检测 Java 中给定文本中的句子。

例如，假设句号、问号或感叹号在给定文本中表示句子的结束，那么我们可以使用String类的split()方法分割句子。这里，我们必须以字符串格式传递正则表达式。

以下是使用 Java 正则表达式（split 方法）确定给定文本中句子的程序。将此程序保存在名为SentenceDetection_RE.java的文件中。

public class SentenceDetection_RE {  
   public static void main(String args[]){ 
     
      String sentence = " Hi. How are you? Welcome to Tutorialspoint. " 
         + "We provide free tutorials on various technologies"; 
     
      String simple = "[.?!]";      
      String[] splitString = (sentence.split(simple));     
      for (String string : splitString)   
         System.out.println(string);      
   } 
}

使用以下命令从命令提示符编译并执行保存的 Java 文件。

javac SentenceDetection_RE.java 
java SentenceDetection_RE

执行后，上述程序将创建一个 PDF 文档，显示以下消息。

Hi 
How are you 
Welcome to Tutorialspoint 
We provide free tutorials on various technologies

使用 OpenNLP 进行句子检测

为了检测句子，OpenNLP 使用一个预定义的模型，一个名为en-sent.bin的文件。此预定义模型经过训练，可以检测给定原始文本中的句子。

opennlp.tools.sentdetect包包含用于执行句子检测任务的类和接口。

要使用 OpenNLP 库检测句子，您需要：

使用SentenceModel类加载en-sent.bin模型
实例化SentenceDetectorME类。
使用此类的sentDetect()方法检测句子。

以下是编写一个程序的步骤，该程序从给定的原始文本中检测句子。

步骤 1：加载模型

句子检测模型由名为SentenceModel的类表示，该类属于opennlp.tools.sentdetect包。

要加载句子检测模型：

创建模型的InputStream对象（实例化 FileInputStream 并将其构造函数中的模型路径以字符串格式传递给它）。
实例化SentenceModel类并将模型的InputStream（对象）作为参数传递给其构造函数，如下面的代码块所示：

//Loading sentence detector model 
InputStream inputStream = new FileInputStream("C:/OpenNLP_models/ensent.bin"); 
SentenceModel model = new SentenceModel(inputStream);

步骤 2：实例化 SentenceDetectorME 类

opennlp.tools.sentdetect包的SentenceDetectorME类包含用于将原始文本拆分为句子方法。此类使用最大熵模型来评估字符串中的句子结束字符，以确定它们是否表示句子的结束。

实例化此类并将上一步创建的模型对象传递给它，如下所示。

//Instantiating the SentenceDetectorME class 
SentenceDetectorME detector = new SentenceDetectorME(model);

步骤 3：检测句子

SentenceDetectorME类的sentDetect()方法用于检测传递给它的原始文本中的句子。此方法接受一个 String 变量作为参数。

通过将句子的字符串格式传递给此方法来调用此方法。

//Detecting the sentence 
String sentences[] = detector.sentDetect(sentence);

示例

以下是检测给定原始文本中句子的程序。将此程序保存在名为SentenceDetectionME.java的文件中。

import java.io.FileInputStream; 
import java.io.InputStream;  

import opennlp.tools.sentdetect.SentenceDetectorME; 
import opennlp.tools.sentdetect.SentenceModel;  

public class SentenceDetectionME { 
  
   public static void main(String args[]) throws Exception { 
   
      String sentence = "Hi. How are you? Welcome to Tutorialspoint. " 
         + "We provide free tutorials on various technologies"; 
       
      //Loading sentence detector model 
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin"); 
      SentenceModel model = new SentenceModel(inputStream); 
       
      //Instantiating the SentenceDetectorME class 
      SentenceDetectorME detector = new SentenceDetectorME(model);  
    
      //Detecting the sentence
      String sentences[] = detector.sentDetect(sentence); 
    
      //Printing the sentences 
      for(String sent : sentences)        
         System.out.println(sent);  
   } 
}

使用以下命令从命令提示符编译并执行保存的 Java 文件：

javac SentenceDetectorME.java 
java SentenceDetectorME

执行后，上述程序读取给定的字符串并检测其中的句子并显示以下输出。

Hi. How are you? 
Welcome to Tutorialspoint. 
We provide free tutorials on various technologies

检测句子的位置

我们还可以使用SentenceDetectorME 类的 sentPosDetect() 方法检测句子的位置。

以下是编写一个程序的步骤，该程序从给定的原始文本中检测句子的位置。

步骤 1：加载模型

句子检测模型由名为SentenceModel的类表示，该类属于opennlp.tools.sentdetect包。

要加载句子检测模型：

创建模型的InputStream对象（实例化 FileInputStream 并将其构造函数中的模型路径以字符串格式传递给它）。
实例化SentenceModel类并将模型的InputStream（对象）作为参数传递给其构造函数，如下面的代码块所示。

//Loading sentence detector model 
InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin"); 
SentenceModel model = new SentenceModel(inputStream);

步骤 2：实例化 SentenceDetectorME 类

实例化此类并将上一步创建的模型对象传递给它。

//Instantiating the SentenceDetectorME class 
SentenceDetectorME detector = new SentenceDetectorME(model);

步骤 3：检测句子的位置

SentenceDetectorME类的sentPosDetect()方法用于检测传递给它的原始文本中句子的位置。此方法接受一个 String 变量作为参数。

通过将句子的字符串格式作为参数传递给此方法来调用此方法。

//Detecting the position of the sentences in the paragraph  
Span[] spans = detector.sentPosDetect(sentence);

步骤 4：打印句子的跨度

SentenceDetectorME类的sentPosDetect()方法返回一个类型为Span的对象数组。opennlp.tools.util包中名为 Span 的类用于存储集合的开始和结束整数。

您可以将sentPosDetect()方法返回的跨度存储在 Span 数组中并打印它们，如下面的代码块所示。

//Printing the sentences and their spans of a sentence 
for (Span span : spans)         
System.out.println(paragraph.substring(span);

示例

以下是检测给定原始文本中句子的程序。将此程序保存在名为SentenceDetectionME.java的文件中。

import java.io.FileInputStream; 
import java.io.InputStream; 
  
import opennlp.tools.sentdetect.SentenceDetectorME; 
import opennlp.tools.sentdetect.SentenceModel; 
import opennlp.tools.util.Span;

public class SentencePosDetection { 
  
   public static void main(String args[]) throws Exception { 
   
      String paragraph = "Hi. How are you? Welcome to Tutorialspoint. " 
         + "We provide free tutorials on various technologies"; 
       
      //Loading sentence detector model 
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin"); 
      SentenceModel model = new SentenceModel(inputStream); 
       
      //Instantiating the SentenceDetectorME class 
      SentenceDetectorME detector = new SentenceDetectorME(model);  
       
      //Detecting the position of the sentences in the raw text 
      Span spans[] = detector.sentPosDetect(paragraph); 
       
      //Printing the spans of the sentences in the paragraph 
      for (Span span : spans)         
         System.out.println(span);  
   } 
}

使用以下命令从命令提示符编译并执行保存的 Java 文件：

javac SentencePosDetection.java 
java SentencePosDetection

执行后，上述程序读取给定的字符串并检测其中的句子并显示以下输出。

[0..16) 
[17..43) 
[44..93)

句子及其位置

String 类的substring()方法接受开始和结束偏移量并返回相应的字符串。我们可以使用此方法一起打印句子及其跨度（位置），如下面的代码块所示。

for (Span span : spans)         
   System.out.println(sen.substring(span.getStart(), span.getEnd())+" "+ span);

以下是检测给定原始文本中的句子并将其与位置一起显示的程序。将此程序保存在名为SentencesAndPosDetection.java的文件中。

import java.io.FileInputStream; 
import java.io.InputStream;  

import opennlp.tools.sentdetect.SentenceDetectorME; 
import opennlp.tools.sentdetect.SentenceModel; 
import opennlp.tools.util.Span; 
   
public class SentencesAndPosDetection { 
  
   public static void main(String args[]) throws Exception { 
     
      String sen = "Hi. How are you? Welcome to Tutorialspoint." 
         + " We provide free tutorials on various technologies"; 
      //Loading a sentence model 
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin"); 
      SentenceModel model = new SentenceModel(inputStream); 
       
      //Instantiating the SentenceDetectorME class 
      SentenceDetectorME detector = new SentenceDetectorME(model);  
       
      //Detecting the position of the sentences in the paragraph  
      Span[] spans = detector.sentPosDetect(sen);  
      
      //Printing the sentences and their spans of a paragraph 
      for (Span span : spans)         
         System.out.println(sen.substring(span.getStart(), span.getEnd())+" "+ span);  
   } 
}

使用以下命令从命令提示符编译并执行保存的 Java 文件：

javac SentencesAndPosDetection.java 
java SentencesAndPosDetection

执行后，上述程序读取给定的字符串并检测句子及其位置并显示以下输出。

Hi. How are you? [0..16) 
Welcome to Tutorialspoint. [17..43)  
We provide free tutorials on various technologies [44..93)

句子概率检测

SentenceDetectorME类的getSentenceProbabilities()方法返回与最近对 sentDetect() 方法的调用关联的概率。

//Getting the probabilities of the last decoded sequence       
double[] probs = detector.getSentenceProbabilities();

以下是打印与对 sentDetect() 方法的调用关联的概率的程序。将此程序保存在名为SentenceDetectionMEProbs.java的文件中。

import java.io.FileInputStream; 
import java.io.InputStream;  

import opennlp.tools.sentdetect.SentenceDetectorME; 
import opennlp.tools.sentdetect.SentenceModel;  

public class SentenceDetectionMEProbs { 
  
   public static void main(String args[]) throws Exception { 
   
      String sentence = "Hi. How are you? Welcome to Tutorialspoint. " 
         + "We provide free tutorials on various technologies"; 
       
      //Loading sentence detector model 
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin");
      SentenceModel model = new SentenceModel(inputStream); 
       
      //Instantiating the SentenceDetectorME class
      SentenceDetectorME detector = new SentenceDetectorME(model);  
      
      //Detecting the sentence 
      String sentences[] = detector.sentDetect(sentence); 
    
      //Printing the sentences 
      for(String sent : sentences)        
         System.out.println(sent);   
         
      //Getting the probabilities of the last decoded sequence       
      double[] probs = detector.getSentenceProbabilities(); 
       
      System.out.println("  "); 
       
      for(int i = 0; i<probs.length; i++) 
         System.out.println(probs[i]); 
   } 
}

使用以下命令从命令提示符编译并执行保存的 Java 文件：

javac SentenceDetectionMEProbs.java 
java SentenceDetectionMEProbs

执行后，上述程序读取给定的字符串并检测句子并打印它们。此外，它还返回与最近对 sentDetect() 方法的调用关联的概率，如下所示。

Hi. How are you? 
Welcome to Tutorialspoint. 
We provide free tutorials on various technologies 
   
0.9240246995179983 
0.9957680129995953 
1.0

打印页面