ReferenceExtractor

java.lang.Object
- org.grobid.core.engines.patent.ReferenceExtractor

All Implemented Interfaces:

java.io.Closeable, java.lang.AutoCloseable
```
public class ReferenceExtractor
extends java.lang.Object
implements java.io.Closeable
```
Extraction of patent and NPL references from the content body of patent document with Conditional Random Fields.

Field Summary

Fields
Modifier and Type Field and Description

java.lang.String currentPatentNumber

boolean debug

Lexicon lexicon

OPSService ops

java.util.ArrayList<BibDataSet> resBib

Fields
Modifier and Type	Field and Description
`java.lang.String`	`currentPatentNumber`
`boolean`	`debug`
`Lexicon`	`lexicon`
`OPSService`	`ops`
`java.util.ArrayList<BibDataSet>`	`resBib`

Constructor Summary

Constructors
Constructor and Description

ReferenceExtractor()

ReferenceExtractor(EngineParsers parsers)

Constructors
Constructor and Description
`ReferenceExtractor()`
`ReferenceExtractor(EngineParsers parsers)`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`java.lang.String`	`annotateAllReferences(Document doc, java.util.List<LayoutToken> tokenizations, boolean filterDuplicate, int consolidate, boolean includeRawCitations, java.util.List<PatentItem> patents, java.util.List<BibDataSet> articles)` Annotate all reference from a list of layout tokens.
`java.lang.String`	`annotateAllReferencesPDFFile(java.lang.String inputFile, boolean filterDuplicate, int consolidate, boolean includeRawCitations, java.util.List<PatentItem> patents, java.util.List<BibDataSet> articles)` JSON annotations for all reference from the PDF file of a patent publication.
`void`	`close()`
`java.lang.String`	`extractAllReferencesOPS(boolean filterDuplicate, int consolidate, boolean includeRawCitations, java.util.List<PatentItem> patents, java.util.List<BibDataSet> articles)` Extract all reference from the full text retrieve via OPS.
`java.lang.String`	`extractAllReferencesPDFFile(java.lang.String inputFile, boolean filterDuplicate, int consolidate, boolean includeRawCitations, java.util.List<PatentItem> patents, java.util.List<BibDataSet> articles)` Extract all reference from the PDF file of a patent publication.
`java.lang.String`	`extractAllReferencesString(java.lang.String text, boolean filterDuplicate, int consolidate, boolean includeRawCitations, java.util.List<PatentItem> patents, java.util.List<BibDataSet> articles)` Extract all reference from a simple piece of text.
`java.lang.String`	`extractAllReferencesXMLFile(java.lang.String pathXML, boolean filterDuplicate, int consolidate, boolean includeRawCitations, java.util.List<PatentItem> patents, java.util.List<BibDataSet> articles)` Extract all reference from an XML file in ST.36 or MAREC format.
`java.lang.String`	`extractPatentReferencesXMLFile(java.lang.String pathXML, boolean filterDuplicate, int consolidate, boolean includeRawCitations, java.util.List<PatentItem> patents)` Extract all reference from a patent in XML ST.36 like.
`void`	`generateTrainingData(java.lang.String documentPath, java.lang.String newTrainingPath)` Annotate a new XML patent document based on training data format with the current model.
`void`	`generateXMLReport(java.io.File file, java.util.ArrayList<PatentItem> patents, java.util.ArrayList<BibDataSet> articles)` Write the list of extracted references in an XML file
`boolean`	`getDocOPS(java.lang.String number)` Get a patent description by its number and OPS
`java.lang.String`	`reference2BibTeX(int i)` Get the BibTeX string corresponding to the recognized citation section for a given citation
`java.lang.String`	`reference2TEI(int i)` Get the TEI XML string corresponding to the recognized citation section for a particular citation
`java.lang.String`	`references2BibTeX()` Get the BibTeX string corresponding to the recognized citation section
`java.lang.String`	`references2TEI()` Get the TEI XML string corresponding to the recognized citation section, with pointers and advanced structuring
`java.lang.String`	`references2TEI2()` Get the TEI XML string corresponding to the recognized citation section
`void`	`setDocumentPath(java.lang.String dirName)`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

debug
```
public boolean debug
```

lexicon
```
public Lexicon lexicon
```

currentPatentNumber

public java.lang.String currentPatentNumber

ops
```
public OPSService ops
```

resBib

public java.util.ArrayList<BibDataSet> resBib

Constructor Detail

ReferenceExtractor
```
public ReferenceExtractor()
```

ReferenceExtractor

public ReferenceExtractor(EngineParsers parsers)

Method Detail

setDocumentPath

public void setDocumentPath(java.lang.String dirName)

extractAllReferencesOPS

public java.lang.String extractAllReferencesOPS(boolean filterDuplicate,
                                                int consolidate,
                                                boolean includeRawCitations,
                                                java.util.List<PatentItem> patents,
                                                java.util.List<BibDataSet> articles)

Extract all reference from the full text retrieve via OPS.

extractPatentReferencesXMLFile

public java.lang.String extractPatentReferencesXMLFile(java.lang.String pathXML,
                                                       boolean filterDuplicate,
                                                       int consolidate,
                                                       boolean includeRawCitations,
                                                       java.util.List<PatentItem> patents)

Extract all reference from a patent in XML ST.36 like.

extractAllReferencesXMLFile

public java.lang.String extractAllReferencesXMLFile(java.lang.String pathXML,
                                                    boolean filterDuplicate,
                                                    int consolidate,
                                                    boolean includeRawCitations,
                                                    java.util.List<PatentItem> patents,
                                                    java.util.List<BibDataSet> articles)

Extract all reference from an XML file in ST.36 or MAREC format.

extractAllReferencesPDFFile

public java.lang.String extractAllReferencesPDFFile(java.lang.String inputFile,
                                                    boolean filterDuplicate,
                                                    int consolidate,
                                                    boolean includeRawCitations,
                                                    java.util.List<PatentItem> patents,
                                                    java.util.List<BibDataSet> articles)

Extract all reference from the PDF file of a patent publication.

annotateAllReferencesPDFFile

public java.lang.String annotateAllReferencesPDFFile(java.lang.String inputFile,
                                                     boolean filterDuplicate,
                                                     int consolidate,
                                                     boolean includeRawCitations,
                                                     java.util.List<PatentItem> patents,
                                                     java.util.List<BibDataSet> articles)

JSON annotations for all reference from the PDF file of a patent publication.

extractAllReferencesString

public java.lang.String extractAllReferencesString(java.lang.String text,
                                                   boolean filterDuplicate,
                                                   int consolidate,
                                                   boolean includeRawCitations,
                                                   java.util.List<PatentItem> patents,
                                                   java.util.List<BibDataSet> articles)

Extract all reference from a simple piece of text.

annotateAllReferences

public java.lang.String annotateAllReferences(Document doc,
                                              java.util.List<LayoutToken> tokenizations,
                                              boolean filterDuplicate,
                                              int consolidate,
                                              boolean includeRawCitations,
                                              java.util.List<PatentItem> patents,
                                              java.util.List<BibDataSet> articles)

Annotate all reference from a list of layout tokens.

references2TEI2
```
public java.lang.String references2TEI2()
```
Get the TEI XML string corresponding to the recognized citation section

reference2TEI
```
public java.lang.String reference2TEI(int i)
```
Get the TEI XML string corresponding to the recognized citation section for a particular citation

references2BibTeX
```
public java.lang.String references2BibTeX()
```
Get the BibTeX string corresponding to the recognized citation section

references2TEI
```
public java.lang.String references2TEI()
```
Get the TEI XML string corresponding to the recognized citation section, with pointers and advanced structuring

reference2BibTeX
```
public java.lang.String reference2BibTeX(int i)
```
Get the BibTeX string corresponding to the recognized citation section for a given citation

generateTrainingData
```
public void generateTrainingData(java.lang.String documentPath,
                                 java.lang.String newTrainingPath)
```
Annotate a new XML patent document based on training data format with the current model.

Parameters:

documentPath - is the path to the file to be processed

newTrainingPath - new training path

getDocOPS
```
public boolean getDocOPS(java.lang.String number)
```
Get a patent description by its number and OPS

generateXMLReport

public void generateXMLReport(java.io.File file,
                              java.util.ArrayList<PatentItem> patents,
                              java.util.ArrayList<BibDataSet> articles)

Write the list of extracted references in an XML file

close
```
public void close()
           throws java.io.IOException
```
Specified by:

close in interface java.io.Closeable

Specified by:

close in interface java.lang.AutoCloseable

Throws:

java.io.IOException

Class ReferenceExtractor

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

debug

lexicon

currentPatentNumber

ops

resBib

Constructor Detail

ReferenceExtractor

ReferenceExtractor

Method Detail

setDocumentPath

extractAllReferencesOPS

extractPatentReferencesXMLFile

extractAllReferencesXMLFile

extractAllReferencesPDFFile

annotateAllReferencesPDFFile

extractAllReferencesString

annotateAllReferences

references2TEI2

reference2TEI

references2BibTeX

references2TEI

reference2BibTeX

generateTrainingData

getDocOPS

generateXMLReport

close