Quantcast
Channel: TechCenter
Viewing all articles
Browse latest Browse all 2350

The Dell Business Intelligence Project Using USPTO Data: Episode 10

$
0
0
Revision 1 posted to Business Intelligence and Appliance Solutions by Robert_Pound on 5/31/2013 2:08:40 PM

Author: Robert Pound

Links:Enterprise SolutionsBusiness Intelligence and Appliance Solutions

Episode 10 | Dell Boomi One Year

This episode uses the process created in Episode 8 to extract data from the US patent office website and store it in a SQL database. We will modify the process in Episode 8 to collect one year’s worth of US Patent Office data, and then execute the SQL stored procedure discussed in Episode 9 to remove duplicate data.

The USPTO project uses Dell Boomi, Dell Quickstart, Data Warehouse Appliance, and Toad products to analyze publicly-available data. For more information about the goal and scope of the project, as well as a breakdown of the episodes, go to Episode 1.

Dell Boomi One Year

This final Dell Boomi episode combines all processes into a single process that will extract and translate one year’s worth of US Patent Office data into meaningful SQL data that is ready to be analyzed. This episode contains the following sections:

  •  Extract multiple files
    • Current solution
      • Create manual file list
      • Create disk connector
      • Create split shape
    • Future solution
      • Groovy project
  • Modify previous project
    • Add multiple file shapes
    • Add SQL procedure

Extract Multiple Files

There are two solutions for collecting a year’s worth of data in Boomi.

The first solution involves manually creating a list of files to download so that the previously-created process can execute on each file. Because of the simplicity and short time to implement, we chose this option for the project.

The second solution involves using Groovy programming to create a custom script in the data process shape to automatically parse the US Patent Office main page for all current files for a given year and then feed that into the HTTP connector. This would involve a long development cycle; therefore, this will be a future topic.

Current Solution

The current solution involves creating a flat file that contains all files for a given year and saving it to the Quickstart Data Warehouse Appliance server. We are creating a disk connect similar to those created in previous episodes.

Create Manual File List

1.    Highlight the desired file names at the website http://www.google.com/googlebooks/uspto-patents-grants-text.html. The example in Figure 1 shows all of 2013 files.

                                                                                            Figure 1.      Example files

2.    Copy the highlighted file names.

3.    Open Notepad.

4.    Paste the contents into Notepad.

5.    Format the Notepad contents to one file per line as shown in Figure 2.

                                                                              Figure 2.      Sample Notepad contents

6.    Select File -> Save As. For example, save as c:\uspto\2013_list.txt.

Create Disk Connector

Create a new Connector Operation:

1.    In the Create Component window, select Connector Operation.

2.    Name the component 2013_list.

3.    Select the folder in which the previous project was created.

4.    Select type disk.

5.    Click Create.

Modify the Connector Operation:

1.    Select Get.

2.    File Filter: 2013_list.txt

3.    File Matching Type: *

4.    Click Save, and then click Close.

Create Split Shape

To create the data process shape, the process from the previous episode must first be opened.

1.    Drag and drop a data process shape to the process.

2.    Label: Split year.

3.    Add Step.

a.    Process Type: Split Document

b.    Profile Type: Flat file

c.     Split Option: Split by line

4.    Click OK.

Future Solution

A more complete solution to pulling one year or multiple years of data is by programmatically parsing the bulk data page for the US Patent Office and collecting all file names for a given year. This would involve using Groovy, which is the language used in the custom scripting option of data process shapes. You would use Groovy to pull in data from the HTML page and then output the list of files in a line-by-line format. This would eliminate the dependence on the local disk connector as well as make the process more flexible and reduce the amount of manual intervention.

Modifying Previous Project

Since there was only one addition, Split Year, from the steps above, the process should look like Figure 3 below.

                                                                        Figure 3.      Modifying the previous project

Add Multiple File Shapes

To finalize the process, we will first add the multiple file content created at the start of this episode:

1.    Configure start shape.

a.    Select No Data.

b.    Click OK.

2.    Drag and drop the connector shape.

a.    Label: get 2013 list

b.    Connector: Disk

c.     Action: Get

d.    Connection: local USPTO

e.    Operation: 2013_list

f.     Click OK.

3.    Drag and drop connector shape

a.    Label: http multi

b.    Connector: HTTP Client

c.     Action: Get

d.    Connection: New

i.    Label: 2013 bulk

ii.    URL: http://storage.googleapis.com/patents/grant_full_text/2013/

iii.    Click Save, and then click Close.

e.    Operation: New

i.    Label: 2013 in

ii.    Under Resource Path, create new variable:

1.    USPTO_IN

2.    Is replacement variable?: checked

iii.    Click Save and then Close.

f.     Parameters tab

i.    Select new (+)

1.    Input: USPTO_file

2.    Type: Profile Element

3.    Profile Type: Flat File

4.    Profile: Create New (+)

a.    Label: List Profile

b.    New Element

i.    Data Name: Name

c.     Click Save, and then click Close.

5.    Element: Name

ii.    Click OK.

g.    Click OK.

4.    Connect new shapes to previous content in the following order:

a.    start shape -> get 2013 list -> Split year -> http multi

                                                                            Figure 4.      Connecting the new shapes

5.    Click Save.

Add SQL Procedure

Using the stored procedure, the SQL connector, and Program Command shape discussed in the previous episode, we will add the final section to the process to create a complete Dell Boomi process for collecting and storing data into SQL.

1.    Create shape:

a.    Drag and drop Program Command

i.    Label: clean and insert

ii.    Type: Stored Procedure

iii.    Connection: SQL connection

iv.    Procedure Name: dbo.del_dup

b.    Click OK.

2.    Connect shape just before stop.

3.    The final process should look like Figure 5.

                                                                                            Figure 5.      Final process


Viewing all articles
Browse latest Browse all 2350

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>