The Dell Business Intelligence Project Using USPTO Data: Episode 5
Author: Robert Pound,RogerLopez,UmeshSunnapu
Links:Enterprise Solutions, Business Intelligence and Appliance Solutions
Episode 5 | Dell Boomi HTTP to CSV
This episode combines the previous processes to extract and map an XML file from the USPTO Web site to a CSV file on a local disk.
The USPTO project uses Dell Boomi, Dell Quickstart Data Warehouse Appliance and Toad products to analyze publicly available data. For more information about the goal and scope of the project, as well as a breakdown of the episodes, go to Episode 1.
Dell Boomi HTTP to CSV process
In Episode 3, we created a Dell Boomi process to extract data from USPTO. In Episode 4, we highlighted mapping from XML to CSV. Now, we combine these processes to connect to the HTTP file and translate it directly to CSV. The following sections outline this episode:
- Combine processes
- Copy First Project
- Add map from XML to CSV project
- Modify XML profile
- Format malformed XML
- Comment headers
- Add root tag
- Run process
- Modify Atom
Combining processes
Most components used in this episode are the same components from the first project. Begin by copying the first process. Then add in the map components from the second process. Adding the map components loads the data to CSV. These steps bring us one step closer to loading the data into the database.
To copy the previous projects:
1. In the Component Explorer, select the blue triangle next to First project and select Copy.
Figure 1. Use Component Explorer to copy previous projects.
2. In the Copy Component window, select the same folder used in the previous project (in this example Dell-REP), and clear the Copy component dependents? check box.
Figure 2. Selecting Dell-Rep folder.
3. Click OK.
4. In Component Explorer, double-click the process named First Project 2.
5. Rename the process to HTTP to CSV.
6. Click Save.
Adding a map
To add a map:
1. Drag and drop a map shape to the newly created process.
2. Click the icon to view all components.
3. Select the XML to CSV map created in the previous process.
Figure 3. Selecting the XML to CSV map.
4. Click OK.
Modifying components
Modifying the filename
The process output is a CSV file. The first task is to modify the fileName shape.
1. Highlight the fileName shape and click Configure.
2. Change the name to fileName test.csv and then select the icon under Parameters field.
3. Change the filename to test.csv and select OK.
4. Click OK.
Modifying the XML profile
Next, modify the map to parse all the patents in the file we downloaded from the site. For the Dell Boomi XML parser to extract data, the XML file must have a single parent element that encapsulates the data. To do this, add the element <root> to the file. After adding the <root> element to the file, modify the XML profile.
To modify the XML profile:
1. Go to Component Explorer and double-click the USPTO_XML profile.
Figure 4. The USPTO_XML profile.
You must manipulate the nodes because there is no way to add parent nodes to the current top parent node.
2. In the us-patent-grant element drop-down menu, select Add Child Element. This creates an element called element at the bottom of the profile.
Figure 5. Selecting the Add Child Element.
3. Select element and rename it to us-patent-grant.
4. Set Min Occurs to 1 and Max Occurs to unbounded. This change makes sure that you retrieve multiple patents.
Figure 6. Renaming the element.
5. Select the topmost us-patent-grant element and rename it root.
6. Set the Min Occurs and Max Occurs to 1. The profile should now look like Figure 6.
Figure 7. The root element.
7. Move all elements except root under the newly created us-patent-grant element by dragging the elements to the us-patent-grant element. Once the elements are added, a plus sign appears.
Figure 8. After adding elements, a plus sign appears.
8. Drop the element and it should now be added as a child element to us-patent-grant.
Figure 9. Dropping the element adds it as a child element under us-patent-grant.
9. Repeat this process for all remaining elements until the profile looks like Figure 9.
Figure 10. Final view of the us-patent-grant element.
10. Click Save and Close.
Formatting malformed XML
To deal with the malformed USPTO XML, add two shapes:
- One shape to deal with the multiple DTD headers
- One shape to add to the <root>
Before creating shapes, adjust the process to make room.
To adjust our process to make room:
1. Disconnect the arrow from the Unzip and fileName shapes.
2. Connect the arrow from XML to CSV to filename.
Your process should look like Figure 10:
Figure 11. Making room by adjusting our process.
csv Commenting headersThe next step is commenting out two tags. This is similar to how we manually removed them in the previous episode. This time, use XML style comments <!--comment--> to remove them. Original headers<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]> Commented headers<!-- xml version="1.0" encoding="UTF-8"?> <!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" --> To comment headers: 1. Hghlight the shape and select Configure. 2. Change the name to Unzip/Comment so it reflects its new purpose. 3. To add a step, under Processing Step, select the (+) icon. 4. In the drop-down list, select Search/Replace, and then enter the following information: Text To Find: \<\?xml Replace With: <!-- *Note: Modify the string <?xml to \<\?xml because the underlying code running Dell Boomi needs escape characters for special characters to process the string. Similarly, in the next step the string [ ]> is modified to \[ \]\>. 5. Add a second step by selecting the (+) icon and selecting a Search/Replace process with the following text boxes: Text To Find: \[ \]\> Replace With: --> Figure 12. Replacing text. 6. Click OK. Adding a root elementUse a Message shape to add the root element to our document. To add a root element: 1. Drag-and-drop the shape into the processand name it Add Root. 2. In the Message Properties window, under the Parameters field, click the (+) icon. 3. In the Parameter Value window, from the drop-down list, select Current Data. 4. Click OK. 5. In the Message Properties dialog box, in the Message text box, add the following: <root> {1} </root> Figure 13. Message Properties dialog box. 6. Click OK. Figure 14. The completed process should look like this. 7. Click Save and Close. The Run processThe addition of the Search/Replace and Message steps greatly affect the memory usage of the Atom. The Dell Boomi support team is looking for a solution. The team modified the memory allocated to the Atom for the process to complete. If you see a Java heap error, follow the steps in the Modify Atom section. Once you save and deploy your process, the test.csv is stored in your local directory. The filename depends on whether you have run the previous process multiple times. The file should be approximately 338kb and have multiple entries for each patent. For direction on how to deploy a process, refer to the link Boomi Getting Started . When successfully run, the test.csv should look like the document snippet in Figure 15. Figure 15. A successfully run test.csv.
Modifying AtomThe following steps are found on the Dell Boomi Community page. You can find these steps by going to the Help & Feedback drop-down list in the upper right corner of the page and selecting Support & Community. This launches a separate User Community page where you can use the Search Community search bar. For this example, the instructions are modified to reflect the amount of memory used to make the process work correctly. To increase the memory: 1. Stop the Atom. 2. Navigate to: \Boomi AtomSphere\<Atom name>\bin. 3. Using a text editor, such as Notepad, open the atom.vmoptions file. 4. Change -Xmx512m to –Xmx16384m. 5. Save the file and restart the Atom. *Note: For this project, we are using the Dell Quickstart Data Warehouse Appliance, which has 96GB to 128GB of installed RAM. This modification to the 16GB of RAM allocated to the Atom was not an issue for our team. If you are following these episodes on a system with memory close to 16GB, this procedure may cause issues. |