ABC Import Filters 4 - Developing simple filters

Developing Simple Import Filters

Creating an Import filter without using regular expressions

Introduction

This section will deal with a website which has a structured presentation of data providing the equivalent of tagged fields and yet does not present any real complications.

The main objective of preparing this particular filter is to introduce some of the import filter windows and their functionality.

As an aid to understanding a simplistic conceptual diagram of some of the different parts of Biblioscape filters is provided below.

Import filters conceptual diagram

Fig. 1

Although it would be possible to simplify the filter compilation process used with this first filter, the actual process followed will facilitate elaboration later in compiling filters of more difficult sites.

Building the Import Filter

Open the References module CTRL+ALT+R

In the Folders List select the Examples folder and create a child folder called Thesaurus.com

Open the BiblioBrowser CTRL+ALT+W

Open the website http://www.thesaurus.com open the thesaurus tab and search for the word “Privacy”.

Capture the web page as a reference via the “Capture Page” button and the “As a Reference” selection, when the save as box appears, select the “Web Page, HTML only” then save it into the C:\Documents and Settings\User Name\My Documents\Biblioscape Tutorial\attachments\References\HTML folder created earlier. Remember this will save the reference in HTML format, without the accompanying image files and other material, to the disk, at the same time moving the module focus to the Reference module.

Open the HTML Web page stored on disk within the Biblioscape database in the default browser by selecting the paper clip icon displayed on the right of the new references title bar. That open web page in the default browser can be useful when working.

Should the reader require the BiblioBrowser to work off-line, thereby utilising the computers stored temporary internet files, in Microsoft Internet Explorer select the File|Work Offline menu item. Remember to deselect it from within Internet Explorer when ready to go back on-line.

Returning to Biblioscape open the newly imported reference and in the Rich Text tab view select all and copy to clipboard. CTRL+A and then CTRL+C

Open the import filters dialogue window - File|Import Filters or CTRL+Shift+M

Select the main filter “New” button under the left hand list box to create a new import filter.

When the “Define Import Filter Type” dialogue opens paste the copy of the copied document page into the “Examples” field at the bottom of the dialogue and then name the filter “Thesaurus.com”.

The reason for copying the reference document page to paste into the example box rather than the web page content, is to avoid later complications associated with differences between a web page display and its character content.

It is necessary to be very clear that all entries made in the “Define Import Filter Type” dialogue window will affect all/any sub-filters subsequently added to this filter.

In the “Based On” field, for the moment, type “Tutorial”, although this field is generally intended to contain the name of a filter which the current filter is derived from.

Complete the “Provider” field as appropriate.

In the “Database” field enter “Thesaurus”.

In the “Last Update” field type in the date today. Remember to use the format 20050425 (Year Month Day).

Note the favourites tick box is greyed out but a tick is showing. Because the box is greyed out the filter has not been selected as a favourite. Choose it as a favourite immediately by selecting the greyed out box, which should then no longer be greyed out but should retain the tick. Select OK to close the “Define Import Filter Type” and OK again to also close the import filters dialogue window for the moment.

Take a few minutes to consider the newly captured reference and determine the database fields most appropriate for the data contained in that reference.

A brief indication of uses for the Biblioscape database fields may be found within the help index listed under “All fields in a reference”. The fields chosen should be considered carefully against that list, for when referencing later a badly chosen field may mean adjustments both to the filter and references created within the database using them, or having to customise existing styles.

For the purposes of this initial exercise the following “tags” have been initially identified for the specified database fields.

Web Page Tag Biblioscape Database Target Field

Main Entry: Title

Part of Speech: Description

Definition: Keywords & Document

Synonyms: Document

Antonyms: Document

Source: Secondary Title

Copyright·© Miscellaneous

Now identify the features which separate each individual record. With this web page that is simple as the field names are repeated on the web page for each record so the first or last field may be used.

Open the import filters dialogue window SHIFT+CTRL+M.

Select the Favorites button to reduce the list size.

Select the “Thesaurus.com” filter.

Select the main filter edit button to open the “Define Import Filter Type” dialogue window.

Select the “Record” tab.

Define Import Filter Dialogue – Record tab

Import filter record tab

Taking a few moments to review the “Record” tab

The available selections within this tab are:-

The “Blank Line” option. Every blank line within the displayed page will be treated as indicating the start of a new record. Caution should be exercised in the use of this.

The “Sep. Text” option. This is used to identify separating text located at the beginning of any new record.

The “First Tag” option. Used to indicate that the identified first tag should be treated as the record separator. The term “tag” is used to refer to any set of characters used to identify a records field. e.g. in the example at www.thesaurus.com “Main Entry:“ is used to identify or “tag” the word the definitions provided relate to.

The “Last Tag” option indicates the last tag of a record should be treated as the record separator. When this is used the “Reference ID” number has a tendency to increment by two for each reference imported, with one number not being used. What appears to happen is that the records ID counter increments for the new reference and then increments again when the last tag is encountered.

For the purposes of this exercise the “Record” tab “What separates each record” fields of the import filter will contain characters only, as no use of regular expressions or other criteria is being made at this stage. An important point to remember is that the characters entered in any of the fields in this tab must be at the beginning of a new line in the web page being imported (or parsed).

Recall: The contents of this tab must be applicable for each and every sub-filter which may be associated with this filter.

Select “First Tag” and in the text box input “Main Entry: “(There is a trailing space after the colon). Because of applicable database rules for the target field Biblioscape is capable of dealing with the filter correctly if the trailing space is missing from the tag, but for completeness sake and later clarity it is best to include it thereby not leaving an unexpected space to remain within the document being parsed.

Because the required text identified by the “Main Entry: “ tag within the web page is in lower case, and it is required in title case in the Biblioscape database, select the “Replace and Remove” tab.

Define Import Filter Dialogue – Replace and Remove tab

Changes list box

Taking a few moments to review the “Replace or Remove” tab.

The “Limit changes to “Tag” or “Field”” box can either be typed into directly or the selected item visible in the “Available “Field”” list can be transferred across to correct an existing entry with a Biblioscape database field name.

The functional differences between “Tag” and “Field” changes are:-

  • “Tag” changes are conducted prior to insertion into any Biblioscape database field and hence will affect every field that particular “Tag” data is placed in;
  • “Field” changes only apply to the identified database field within the imported record.

See fig. 1 for a graphical illustration of this difference.

Both “Tag” and “Field” will be used during this exercise to illustrate the differences.

Preferably copy from the example display the “Main Entry: ” field identifier and then paste it into the “Limit changes to “Tag” or “Field” box, or one can type that text into the box, once the entry is completed select the small + button to the right of the field, the text will then appear in the “Changes List Box”. Now from the “Change case to:” drop down list select “Title case” and then select the tick to the right of the “Limit changes to “Tag” or “Field” box.

Note. It would seem to be good practice to select the tick button after every change to either the “Find what:”, “Replace with:” or “Change case to:” fields to ensure the alterations are saved. Neglecting selecting this tick will often cause any changes made to those fields to be lost.

The relationships between the “Limit changes to “Tag” or “Field””, the “Available “Field”” and “Changes List Box” are important and need to be clearly understood, so before going further some time will be spent becoming familiar with the functionality. Expect to make mistakes in this area during the import filters learning process.

The only item currently in the “Changes List Box” is “Main Entry: ” with the “Change case to:” field showing “Title case”.

To create another entry in the “Changes List Box”:-

  1. In the “Limit changes to “Tag” or “Field” box type the database field title “Notes” and select the + button. “Notes” should now appear in the “Changes List Box”,
  2. In the “Replace with” textbox for the “Notes” entry type “http://thesaurus.com”, alter the “Change case to:” box to the empty entry and select the tick to the right of the “Limit changes to “Tag” or “Field”” text box to accept the changes.
  3. Now from the “Available “Field”” drop down list select “URL” and select the <- button to the left of that list box.

Notice that the field name “Notes” in the “Change case to:” box has changed to “URL” whilst the “Replace with” field has remained the same. A change of field name carried out in this way does not require the tick button to be selected, as it is committed when the <- button is selected.

Remember this functionality when using the “Available “Field”” box to change an existing entry in the “Changes List Box”. It can be useful when unsure of a field name. i.e. When making an initial entry type “a” or “1” and select the + button then use the “Available “Field”” box to select the correct field name.

Although the “File as” field is available in the “Available “Field”“ drop down list it appears not to function if utilised within an import filter.

Because the URL of the site is not contained within the record and it is not possible to capture it from that web page the “Replace with:” is being used to insert the URL into the record. If necessary “Replace with” can be used to populate an otherwise empty field provided that field content will be consistent across all records imported using that filter. An ability to choose to populate the URL field from the BiblioBrowser URL box, in addition to a tag field, would be a useful enhancement.

To create a further item in the “Changes List Box”:-

  1. In the “Changes List Box” select the “Main Entry: ” item and then select the + button. A new item for “Main Entry: ” will appear at the bottom of the list and that item will be selected. Note this item duplicates the original one in respect of the changes to be made.
  2. Open the “Available “Field”” drop down list, scroll to “Description” and select it, then press the <- button. This will change the selected “Main Entry: ” tag item to a “Description” database field item retaining the “Change case to” entry. No tick is required because the <- button was used, Although during the early stages of using this dialogue it is advisable to select the tick button very time. This item will be required in the filter so it will be retained.

To remove an item from the “Changes List Box” the “-“ (minus) button is used.

Further experience working with this tab will be gained later.

The other two tabs in the “Define Import Filter Dialogue” window “Date and Others” and “Authors and Keywords” will have the default options “Smart parsing” selected. That is what is required for this filter so they will not be touched at the moment.

Select OK to close the “Define Import Filter Dialogue” window.

Recall the earlier conceptual diagram. Some of the main filter items are completed, but the filter will not yet work on the record to be imported as no reference type has yet been identified. The association of specific Reference Types is achieved by the sub-filters.

Adding a Sub-Filter

With the “Thesaurus.com” filter selected in the Main Import Filters window, in the “Sub-Filters List” select the blank item directly under the column header “Reference Type” appearing at the top of the list; a drop down window will appear listing the different reference types existing within the database, select “Electronic Source” and once that appears in the sub-filter list, in the “Default Reference Type” also select “Electronic Source”. The “Default Reference Type” list only contains entries for those reference types selected as sub-filters so “Electronic Source” should be the only entry in that list.

With “Thesaurus.com” selected in the Main Import Filters List and “Electronic Source” selected in the “Sub-Filters List” select the sub-filters edit button (the second button from the right hand side) to open the “Sub-Filter Dialogue” window at the “Match Fields” tab as illustrated.

Sub-Filter Dialogue Window

Sub-filter match fields tab

The “Sub-Filter Dialogue” is used to compile the filter criteria for the specified reference type(s). A very simple rule is that the filter detail entered in the “Define Import Filter Dialogue” is actioned first, followed by the detail contained within the relevant “Sub-Filter Dialogue” as determined by the sub-filter matching data shown in the “Matching Text” box.

This particular filter is a relatively simple one so leave the “Reference type” and “Matching text” fields blank. They will be further explained and used in other examples.

The “Data fields used to match Tag fields to Data fields” is the area now of interest. Tag data entered into these fields will determine which Biblioscape database field the information is entered into. The tag or text entered within that field will not appear within the database record field. In text only tag entries the data entered into database fields commences immediately after the tag which is entered, as illustrated in the following example.

Import text matching graphic

Recall that in the Define Import Filter Dialogue Record tab the text “Main Entry: ” was selected as identifying the start of each record and placed in the first tag text box. Look back to the database fields selected as suitable for the tagged fields, “Main Entry: ” was chosen to be associated with “Title” so enter that into the “Title” field within the “Data fields used to match Tag fields to Data fields” area.

Recall that the tags will appear within the examples text box and so may be copied and pasted to ensure they are correct. Because this is a relatively simple filter with no use made of complex fields it is possible to map the tags to multiple database fields without any complication, thereby maximising the utility of the tag data.

Tags chosen for these fields must be at the beginning of a line.

Take careful note that if only a single tag is used in a single sub-filter field and that is also identified in the Main Filter dialogue window Record Tab “Initial Tag” field, and nothing else is entered anywhere else within the filter, the filter will not work and will report

“0 records are imported. Please make sure the file is in a tagged format. If so please check if the correct import filter was used.”

If needing to test a tag at this early point during filter creation, making an entry in the Replace or Remove tab will overcome this factor.

Complete the fields as follows:

Tag Field Name

Tag

Title

Main Entry:

(There is a trailing space)

Producer (Publisher)

Source:

(There is a trailing space)

Subject

Main Entry:

(There is a trailing space)

Keywords

Definition:

(There is a trailing space)

Document

Definition:

(There is a trailing space)

Description

Part of Speech:

(There is a trailing space)

Miscellaneous

Copyright·©

(There is a trailing space)

Note that in the Document field, only the “Definition: ” tag has been entered. This is possible because the “Synonyms: ” and “Antonyms: ” tags are not otherwise used and follow immediately after “Definition: ” and before any other tag. Additionally those tag words will be required within the document field.

The filter is essentially complete and will now work, but does need some tidying up if the database references are to be correctly formatted. It would be possible to do that immediately (e.g. “Copyright·©” needs inserting at the beginning of the Miscellaneous field) but to simplify that task an example reference would help.

Select OK to exit the dialogue

Once that is done select OK to exit the Main Import Filters dialogue window.

For the purpose of this first filter the associated Internet Resource is being created after the filter, with the later filter examples the associated Internet Resource will be created first.

To link the filter to the web site move to the BiblioBrowser and select “Resources”, “Organize” and then “New”. Create a Resource as previously described with the details:-

Title - Thesaurus.com

URL - http://thesaurus.com

Subject - Etymology

Access - Free

and then associate the new import filter with it by selecting Thesaurus.com from the “Import Filter” drop down list.

Notice that a new Internet Resource group has been created called Etymology.

Now select the new resource reloading the web page and search for “privacy” again.

With “http://thesaurus.com” and the privacy search web page visible select “Capture Reference” importing the references into the Tutorials Examples folder. Eight new references should be successfully imported.

CTRL+ALT+R to move back to References. Open the reference with the title “Privacy” and then switch to the “All Fields” view.

Open the Import Filters Dialogue Window CTRL+Shift+M. Select the “Favourites” button to reduce the list and select the “Thesaurus.com” filter and then the main filter edit button.

Select the “Replace or Remove” tab.

Looking at the record, notice the Miscellaneous field commences with the year, missing “Copyright © ”, because that was used as a tag. Copy the “Copyright © ” (including the trailing space) text from examples box. Create an entry for Miscellaneous in the “Changes List Box” enter ^ at the beginning of the “Replace with” field and paste “Copyright·© “

The caret “^” indicates the replacement characters should be placed at the beginning of the database field. If no ^ is used any replacement is entered at the end of the field, unless existing character(s) positioned at some other point within the field is/are the target for replacement.

Notice in the reference record that the Keyword is not in title case. A title case replacement field already exists so select the “Main Entry: ” item and + button. Change the new items name to “Keywords” by selecting “Keywords” from the “Available “Field”” drop down list and selecting the <- button. Select the OK button twice to close the filters dialogues.

Change to the reference Document view. Notice that the initial wording does not necessarily make sense because the “Definition:” tag text is not imported.

Open the Import Filters Dialogue Window CTRL+Shift+M, select the “Favourites” button to reduce the list, select the “Thesaurus.com” filter, the main filter edit button and then the Replace or Remove tab again.

“Miscellaneous” contains a replace and remove at the beginning of a field so select that and then the + button. Change “Miscellaneous” to “Document” using the “Available “Field”” drop down list and selecting the <- button. In the “Replace with” field leaving the ^ caret as it is change the rest of the text to “Definition: ” (with a trailing space) and select the tick button.

No other major problems exist with that reference so OK twice and change to the next reference in the list. Check each reference created by the import in turn.

Notice on the last reference record (Solitude) that the “Miscellaneous” field contains additional unwanted data.

The tagged data within the Miscellaneous data field can be perceived to be:-

Copyright © 2005 by Lexico Publishing Group, LLC. All rights reserved.

ADVERTISEMENT

Try your search for "privacy" at: Amazon.com - Shop for books, music and more Dictionary.com - Search for definitions HighBeam Research - 32 million documents from leading publications Reference.com - Web Search powered by Google

ADVERTISEMENT ; 2005, Lexico Publishing Group, LLC. All rights reserved.

All of that data appears because there is no tag following the tag for the Miscellaneous field (i.e. “Copyright·© “). However the data actually required is all contained on the first line so it is a simple matter to remove the rest.

Recall from the diagram that complex field definitions work within the information to be held in the database field itself. Because of this any field content should be available for use as with a complex field tag there is no restriction to the beginning of a line.

To ensure the thesaurus.com filter removes that extra data on every import from the site, a sub-filter complex field filter will be used; and to simplify understanding at a later date the filter criteria will consist of “reserved.” If there had only been one full stop within the data that alone could have been used, equally only the “d.” could be used. The particular part of the tagged data chosen was selected because it seems unlikely to change very regularly.

Open the Import Filters Dialogue Window, select the “Favourites” button to reduce the list, select the “Thesaurus.com” filter, the “Electronic Source” sub-filter, the edit button and then the “Miscellaneous” tag field in the “Data fields used to map Tag fields to Data fields” list. Having done that select the “Complex Fields” tab as illustrated below.

From the “Available Fields” list select “Miscellaneous” and then the “Insert” and then the "Edit" button.

In the ID text after selected field type “reserved.” and then select the OK button.

Because the filter field used states that “reserved.” appeared after the selected database field it will not be imported, that word is however needed. So, select the maiwn filter edit button and replace or remove tab.

Create another entry in the “Changes List box” for Miscellaneous and type “ reserved.” (with a leading space) in the “Replace with” field. (As there is no caret or entry in the text to be replaced field “ reserved.” will be placed at the end of the database field.) Select the tick button and OK out of the filters dialogues back to the main reference windows.

One final test. From Bibliobrowser search for another word and capture references for that also. Change back to the References module and check any new references in turn for formatting problems or import filter errors.

Well done! You have created a Biblioscape import filter.

Export the new filter and save it in the C:\Documents and Settings\User Name\My Documents\Biblioscape Tutorial\attachments\Import Filters folder. Create a reference and add the exported filter as an attachment to make it easy to find in the future.