The Dell Business Intelligence Project Using USPTO Data: Episode 2
Author:UmeshSunnapu
Links:EnterpriseSolutions, BusinessIntelligenceandApplianceSolutions
Episode 2 | Sample XML
Welcome to the second episode of the Dell Business Intelligence Project Using USPTO Data. This episode answers the following questions:
- Where to download USPTO Patent Grant Full Text
- What is the Document Type Definition (DTD)
- Where to find the Document Type Definitions (DTDs)
- How to import the sample files into Microsoft Excel
- How to avoid common mistakes
The USPTO project uses Dell Boomi, Dell Quickstart Data Warehouse Appliance and Dell Toad products to analyze publicly available data. For more information about the goal and scope of the project, as well as a breakdown of the episodes, follow this link episode 1.
Where to Download the USPTO Patent Grant Full Text
Download the USPTO Patent Grant Full Text data by navigating to: http://www.google.com/googlebooks/uspto-patents-grants-text.html and targeting the desired files. Each week the patents are compiled into a single XML file and loaded to the Google website. This process produces a total of 52 files for a given year. The Google website consists of patent grants from 1976 to 2012. However, it is important to note that the USPTO data is parsed differently from (1976 to 2001) and (2001 to present).
From the Google USPTO site:
Patent Grant Full Text (2001 to present):
Contains the full text including tables, sequence data and "in-line" mathematical expressions of each patent grant issued weekly (Tuesdays) from January. The file is a concatenation of the Standard Generalized Markup Language (SGML) in accordance with the U.S. Patent Grant Version 2.4 Document Type Definition (DTD) and eXtensible Markup Language (XML) in accordance with the U.S. Patent Grant Version 2.5; 4.0 International Common Element (ICE); 4.1 ICE; 4.2 ICE Document Type Definitions (DTDs). Sequence data XML text in accordance with the ICE SEQLST V1.2 DTD (us-sequence-listing-2004-03-09.dtd) is concatenated next to the containing grant SGML or XML text. References to the following external files are present but the external files are not present:
- Mega Sequence Listing data files
- Mathematica Notebook (NB) files
- CS ChemDraw (CDX) and MDL Information Systems (MOL) files
- Drawings, mathematical expressions, and chemical structures image (TIFF) files
Patent Grant Full Text (1976 to 2001):
Contains the full text of each patent grant issued weekly (Tuesdays) from January 1976 to December 2001. The file format is ASCII text (a.k.a. Patent Grant Green Book). Included are tables and "in-line" mathematical equations, where appropriate, appearing as text data. Chemical structures are not present, but their location is indicated by a structure call-out. Includes patent number, series code and application number, type of patent, filing date, title, issue date, inventor information, assignee name at time of issue, foreign priority information, related US patent documents, classification information, US and foreign references, attorney, agent or firm/legal representative, Patent Cooperation Treaty (PCT) information, abstract, specification, and claims. Approximately 4,000 patent grants per week. Refer to the following link for additional Patent Grant Data/APS documentation:
PatentFullTextAPSDoc_GreenBook.pdf
What is the Document Type Definition (DTD)
A document type definition (DTD) is a set of markup declarations that define a document type. In the Google USPTO, the document type is XML. The DTD file contains syntax to capture the proper elements and references that are found in the Google USPTO XML file.
Where to Find the Document Type Definitions (DTDs)
The DTD files for the data between (2001 to present) are found within the USPTO sample data site.(
Figure 1). USPTOwebsiteforXMLresourses
How do I import sample files into Microsoft Excel?
- Download the sample documents with DTDs (full text with embedded images) file named i110104-sample.zip by clicking here.
- Using your favorite zip extractor, such as Winzip, unzip the i110104-sample.zip.
- Locate and extract all subdirectories within the following path:
- <zip_download_loc>\I20110104 Sample\DTDS\DTDS\PTO-ICE-GRANT-2007\DTDS
- The i110104-sample.zip contains multiple patents that you can view. Select a patent from one of the following directories to extract:
- sample\I110104-sample\I20110104 Sample\DESIGN
- sample\I110104-sample\I20110104 Sample\PLANT
- \sample\I110104-sample\I20110104 Sample\REISSUE
- sample\I110104-sample\I20110104 Sample\SIR
- sample\I110104-sample\I20110104 Sample\UTIL07861
- sample\I110104-sample\I20110104 Sample\UTIL07861
- sample\I110104-sample\I20110104 Sample\UTIL07863
- In this example, we extracted:
- \I20110104 Sample\UTIL07861\US07861317-20110104\US07861317-20110104
- Copy the XML file found within the US07861317-20110104 directory named US07861317-20110104.XML and place it within the DTD folder extracted earlier (<zip_download_loc>\I20110104 Sample\DTDS\DTDS\PTO-ICE-GRANT-2007\DTDS)
- Open Microsoft Excel
Common mistakes
If the system cannot locate an imported XML file, it is commonly due to misplacing the XML files. This error occurs when attempting to open any XML file from a folder that did not include the DTD files and all its subdirectories. From our example, we included the XML file, US07861317-20110104.XML, to the location:
<zip_download_loc>\I20110104 Sample\DTDS\DTDS\PTO-ICE-GRANT-2007\DTDS location.
Figure2. WhenXMLfilesaremisplaced,thesystemcannotlocatetheobjectspecified.
Another common error found with the data on the Google Bulk Downloads page is that Microsoft Excel can only read one XML document at a time. For example, download and extract the file named ipq120103.zip. Within this zip file, the XML file contains multiple embedded XML files. We know that this is due to the XML header that is placed prior to each patent. The XML header reads as follows:
<?xml version="1.0" encoding="UTF-8"?>
If you try to open the XML document directory from the zip file, you see the following error, “File cannot be opened because: Invalid xml declaration.”
Figure 3. File error when opening the XML document directory from inside a zipped directory.
To properly open a file in Microsoft Excel, parse each XML file to include only one <?xml version="1.0" encoding="UTF-8"?> declaration.
For example:
- Create a test.xml file.
- Copy the contents from the downloaded USPTO XML file from:
<?xml version="1.0" encoding="UTF-8"?>
.
.
.
Patent Data
.
.
.
</us-patent-grant>
3.Save new test.xml file document.
Figures
Figure1. USPTOwebsiteforXMLresourses
Figure2. WhenXMLfilesaremisplaced,thesystemcannotlocatetheobjectspecified.