ABC Import Filters 6 - Creating Simple Import Filters using Regular Expressions

Introduction

This particular import filter builds upon the previous knowledge acquired in the ABC’s 4 & 5 and begins to indicate some of the potential for import filters. Once more to minimise complexity a single sub-filter will be used. Two additional programs will also be used to assist in understanding regular expressions and their functionality within the Biblioscape import filters. The descriptions of some basic actions will begin to be reduced to the outcome required. Finally a simple Biblioscape style is adapted to include URL’s within it.

The Biblioscape help files state that the version of regular expressions used is from regexpstudio.com. An ability to test particular filter regular expressions will be of assistance. A test program of some description able to do that would be helpful, preferably one which illustrates the regular expression results. A variety of programs are available on the WWW for this purpose. Among the available freeware ones are:-

· “Test regular expressions”, a freeware program available for download at the regexpstudio website - http://www.regexpstudio.com/RegExpStudio.html

The program contains some example expressions and provides a test environment. The help files contain syntax assistance similar to that provided within the Biblioscape help files. The test environment output is reliant upon user input to find the individual text item(s) required otherwise there is little interactivity with the user.

(An explanation of how to install and use this will be provided later)

· “The Regulator”, a freeware program available at http://tools.osherove.com/CoolTools/TheRegulator/tabid/185/Default.aspx

The Regulator is intended for use with the .NET framework and requires the .NET framework 1.1.4322 to be installed, also some syntax differences will exist.

Although the user interface for The Regulator may initially looking complex for a new user unfamiliar with a development environment it does contain good help files, and during the regular expression compilation process useful pop ups appear containing a list of valid meta characters. The test environment provides a comprehensive view of the regular expression output in an interactive fashion.
Access to an on-line library of regular expressions is also available via this application.

A hexadecimal viewer/editor will also be of assistance during the next stages.

It is feasible to view/edit files in hexadecimal with a number of applications. If you have a favourite one use that or download the freeware version (version 2.0) of HexEditor from http://www.expertcomsoft.com/download.htm; this has a user friendly interface and contains a useful help file containing many tips and explanations, which for new users will possibly compensate for the opening nag screen. (HexEditor 2 free version will be used in later explanations.)

Regular Expressions

Vernon W. Hui has described regular expressions as:-

“Regular expressions provide tools for developing complex pattern matching and textual search-and-replace algorithms. Ask any Perl, egrep, awk or sed developer, and they’ll tell you that regular expressions are one of the most powerful utilities available for manipulating text and data. By creating patterns to match specific strings, a developer has total control over searching, extracting, or replacing data. In short, to master regular expressions is to master your data.”

In a similar way that SQL provides users with extensive abilities with databases, adding regular expressions to Biblioscape has provided the user with an ability to exercise great control over the import and find and replace operations associated with text. A high degree of control can come at some cost, in this case learning regular expressions thoroughly, but any novice (like me) can quickly gain sufficient knowledge to reap a level of benefit.

As stated earlier this document is not intended to deliver a regular expressions tutorial, more of a taster to furnish the reader with sufficient skill and knowledge to provide a foundation in regular expressions facilitating the compilation of effective import filters and potentially stimulating further learning.

The final stages of the previous filter in this ABC series identified to the reader that Biblioscape recognises the declaration of regular expressions by the use of RE( and )RE.

Regular Expression parsing/normal Biblioscape tag parsing

Some important points to remember are:-

1. regular expressions are case sensitive;

2. a Biblioscape import filter sends each text line separately to the regular expression parsing engine as explained in the note below taken from the Biblioscape regular expression help file item “Meta characters – Line separators”

“Note: When regular expressions are used in a Biblioscape import filter, the line separator doesn’t apply because when Biblioscape finds "RE(...)RE" it reads one line at a time and sends it to the regular expression engine for processing. When used in "Edit | Find" or "Edit | Replace", the line separator does apply when working against memo fields like Notes, Abstract, Keywords, Miscellaneous.”


As stated in that note the line separator action is not a strict rule, being affected by rules applicable to the relevant database field(s) and the content of the tag field. These relationships within the parsing mechanism do make the use of line separators in regular expressions more complex. This filter provides one example of when line separators can successfully be used by utilising database rules as a facilitator.


If a regular expression tag is not used the above rule does not apply but the database rules do.

The Biblioscape help files item “Define an Import File” states –

“Multiple Lines: If the text of a tagged field takes more than one line, Biblioscape will combine all the lines according to the following rule:
If the tagged field is “Authors” or “Keywords”, the lines will be first trimmed and joined by “; ”.
For other fields, the lines will be first trimmed and joined by “ ” .” (a space)

Recall from the earlier ABC’s that the database field rules can be as important when compiling an import filter as any parsing tag. The differing approaches/rules can be used to advantage during any parsing process, although they can also confuse the unwary user. Various examples will be provided to illustrate these points.

To effectively utilise regular expressions some understanding of regular expression syntax is required.

Examples of the forms of syntax are documented within the Test Regular Expressions program help files, or the working examples available from the various toolbar buttons provided in the program itself. Because some regular expression meta characters can have different meanings assigned to them depending on the precise circumstances of their application a reasonable level of familiarity or some form of quick reference becomes important.

Many regular expression meta characters, modifiers and escape sequences are listed within the document ABC Import Filters 6A.

The Biblioscape help files provide examples and also contain information about regular expression syntax and meta characters. For the sake of ease and clarity within this first venture into using regular expressions, a consistently simplified approach to the syntax will be facilitated by the extensive use of hexadecimal characters.

Preparation


Configure Biblioscape views

CTRL+ALT+R to open the references module.

To ensure views which can assist in the import filter compilation are showing the

· View|Preview Pane|Preview Header;

view should be selected and visible. The Formatted Preview window may be resized in the normal way, as/if necessary, by grabbing the line between that window and the preview pane window.

Building the Import Filter

Open the Biblioscape Tutorial file and from within the Internet module open the Roget’s Thesaurus website from the Biblioscape Internet Resource link created earlier, then conduct a full text search for the word “Privacy”.

Once the search has been returned select the “Capture Page” button and the “As a Reference” menu item. When the save dialogue opens save as a “Web Page, HTML only” into the C:\Documents and Settings\User Name\My Documents\Biblioscape Tutorial\attachments\References\HTML folder.

Biblioscape should now be in the references module displaying a view of the reference just saved. Select the Save reference toolbar icon and then with the new Roget’s reference on view select the paper clip icon to open the web page saved on hard disk as an attachment in the default browser.

Now consider the newly captured reference and determine:-

1. What type of reference it will become;
2. Which database fields the data contained in the reference should populate;
3. Which data content initially appear useful as tags.

Create a duplicate of the import and use cut and paste to populate the fields if that was found to be an easy way of determining some of the answers. Although again do leave the original reference unchanged. (Recall the create duplicate icon available on the toolbar, which becomes visible when an individual reference is open can be used, although the document field will need to be cut and pasted across.)

As before it may be advantageous to create a printout of the imported reference document so notes can be made. Use coloured marker pens to denote intended tags or fields, or start making a rough list of the requirements.

The following tags at the beginning of lines appear to be available within the Roget’s privacy search record:-

Having completed the initial rough tag identification process, return to Biblioscape and in the Rich Text/Document view of the unedited captured reference select all and copy to clipboard. (CTRL+A, CTRL+C).

Open the import filters dialogue window - File|Import Filters.

Select the “New” button under the left hand list box to create the new filter.

When the “Define Import Filter Type” dialogue opens paste the clipboard entry containing the document page into the “Examples” field at the bottom of the dialogue and then name the filter “Roget’s Thesaurus”.

In the “Based On” field, type “Tutorial”

Complete the “Provider” field.

In the “Comments” field enter “Imports single record only from search response.”

Enter the current date in the “Last Update” field. i.e. 20050425 (Year Month Day)

Choose the new filter as a favourite immediately by selecting the greyed out box to clear it whilst retaining the tick.

Leaving “Blank Line” selected in the “Record” tab OK out of that dialogue.

With “Roget’s Thesaurus” selected in the Main Import Filters List, in the “Sub-Filters List” select “Book Section” from the “Reference Type” drop down list and make “Book Section” the “Default Reference Type” for this filter.

OK out of the filters dialogue for the moment.

Using the HexEditor

Open the chosen hexadecimal program, create a New blank document and paste the captured reference document into that blank document.

If Hex Editor is used and has not been configured answer any requestor dialogues necessary to achieve the pasting action. Do not drag and drop the file C:\Program Files\Biblioscape 6\Temp\bibWebCapture.txt file into Hex Editor, although no requestor dialogues then appear that file is locked with the result that the Biblioscape Import Filters will not function.

The Hex Editor display will end up looking something like:-

To quickly identify a hexadecimal value without opening any dialogues select the character(s) within the ASCII text area and the corresponding character(s) will be indicated in the hexadecimal area. So selecting “1911” near the beginning of the document will show that “31 39 31 31” are the corresponding hexadecimal characters. Take note that the Roget’s document uses the normal “0D 0A” as a return/new line and how those particular characters are by default displayed in green making them simpler to identify.

Using the TRegExpr test program to test filter expressions.


Extract the downloaded TRegExpr files. A good location to extract the files to would be C:\Program Files\Biblioscape 6\Tools\Test Regular Expressions\. Now create a new shortcut for the TestRExp.exe file placing the new shortcut in a new folder named “Test Regular Expressions” in the Biblioscape start menu folder. If need be refer back to the earlier item on creating shortcuts and folders within the Biblioscape Interface Preparation section. Assuming a normal single personal computer install of Windows XP the users Start Menu is located at C:\Documents and Settings\User name\Start Menu. If the Biblioscape start menu folder does not appear there check in the All Users\Start Menu folder.

Open the Test Regular Expressions program and paste the reference document copy (The same entry which appears in the filter example tab and HexEditor.) into the “Input string” text box. (The text box currently contains “My e-mails is anso@mail.ru and anso@usa.net” when the “Expression” tab is selected. Do not worry about deleting the example as it will be regenerated when the TRegExpr program is opened again.

Within the Test Regular Expressions program it is not necessary to use the RE regular expressions declarations utilised by Biblioscape, but it will help later if the start and end brackets are used.

The objective of this particular simple regular expression filter will be to capture three references, one for each of the main definitions. To achieve that the numbers “528.” “666.” and “893.” will be used as record tags. But first some simple illustrations of regular expression syntax.

The regular expression meta character for any number is “\d”.

In the Test Regular Expressions program “Regular Expression” text box delete the default entry and type “(\d\d\d.)” then select the “Exec” button on the bottom left. “1911” has been selected, select the “ExecNext” button and “528.” will be selected “ExecNext” again and “530;” will be selected. The full stop used in the regular expression is identifying anything at all, as the regular expression meta character for any character is a full stop “.”. So a regular expression of “\d\d\d.” matches any three digits followed by any other character. For a regular expression to recognise a meta character as if it were a normal ASCII character a “\” is placed before it. So “\.” would mean a full stop.

In the expression insert a “\” before the full stop “(\d\d\d\.)”. Execute the expression again (“Exec” button) and then select the “ExecNext”. The correct start number is initially identified but other numbers are then found which do not indicate the start of any record. The expression does not include the start of a line as a match requirement.

Recalling from previous filters that “^” is a meta character for the start of a line insert “^” immediately after the parenthesis at the beginning of the regular expression “(^\d\d\d\.)” and test it again. The results should be nothing is found. The reason for this is that the “m” modifier is not selected in the TRegExpr test program. Select that modifier now so that a tick is displayed, deselect all the other modifiers and Exec and ExecNext once more. This time only the three required numbers should be found within the example document.

When testing regular expressions in the Test Regular Expressions program for Biblioscape, by preference only the “m” modifier should be used until more detailed knowledge is gained.

Copy the regular expression recently compiled from the TestRegex program and opening the Roget’s Thesaurus filter made ready earlier, opening the main filter for editing type RE into the beginning of the “Record” tab “First Tag” field, paste the expression and type RE at the end.

OK back to the main filters list and edit the sub-filter. In the “Match Fields” tab make the following entry.


OK out of the import filter dialogues and test that filter. An error is reported, what can be wrong with the regular expression? The answer is nothing, but an issue identified earlier is coming into play. Recall that where only one tag field exists within an import filter and that tag appears within the “Record” “First Tag” field an error will be reported.

With this filter intended to be a single sub-filter filter open the sub-filter again and make the following entry in the “Match Fields” tab.


At this point also make an entry in the filter Replace or Remove tab

Looking at the page open in the browser and using the Hex Editor check the document to identify characters which follow what are intended to be the contents of the Title field, note 2E OD OA immediately follows the required text on all occasions. These hexadecimal codes equate to a full stop, carriage return and new line, which cannot always be used as tags, but with a simple filter and used in the manner intended they can.

Open the complex fields tab and make the following entries in the order listed:-

The \x indicates a hexadecimal character. The three used are the full stop (2e), carriage return (0D) and new line (0A) characters the last two of which also equate with the regular expression meta characters /r (0D) and /n (0A) and could be used instead, but in pursuit of consistency and clarity for new users only the hexadecimal will be utilised for the moment.

Notice how those two complex fields become accessible from both Document and Title map tag fields to data fields.

Now to populate the reference and tidy up.

In the main filter dialogue replace or remove tab make the following entries:-

The replace function is used extensively in this filter to populate some fields.

Import the references once more and additionally in the Capture References dialogue Options tab enter the additional keyword “Privacy” before pressing the “Start” button.

This import filter providing a simple example of regular expression use is now complete, as with previous ones it is possible to tidy it up and further embellish it but the main objective of utilising regular expressions within a simple filter has been attained.

Because many of the available Biblioscape styles do not currently accommodate URL links to e-books it will be necessary to modify one of the current styles to display the URL within the formatted preview window. It is worth the effort to do this short and simple task now as it will serve not only to give a brief view of the style window but also provide some insight into the importance of choosing database fields carefully.

Create a Sample Style using BaseOn.


From within the references module select Tools|Styles|Output Styles or Shift+CTRL+S to open the Output Styles dialogue window.

Scroll in the main style window to the Adv Human Genetics style select it and then the BaseOn button. When the Input Box to name the new journal dialogue opens name the new style “Sampler” and OK.

Select the Sampler style this creates and double click. In the details dialogue which opens select the “Favorites” and “Cite in Note” tick boxes, then in the last update box insert the date and delete the entry in the Category field replacing it with “Tutorial”. Finally select OK.

With the Sampler style selected, in the sub-styles window select “Book Section” and the edit button.

Select the “Reference List/Bibliography” tab and in the “Templates” fields list select the last entry “Static Text”. Now scroll down in the “Available Fields” list and select “URL” then the “Insert” button. This should add URL to the bottom of the “Templates” list.

Double click the “URL” entry in the “Templates” list and in the dialogue window which opens enter “ available from ” (There is a leading and trailing space) in the left hand input box to insert that text before the database field data, then select the OK button.

Double click the “Editors” entry in the “Templates” list and in the dialogue window which opens, in the “Author Name Format” tab select the entry “James Philip Smith” from the drop down list and then select the OK button.

For completeness now change the reference type in the drop down list located in the top left corner of the sub-styles window, to the “Generic” entry, and repeat the steps of adding the URL and tidying the Editors format. (Remember the crib sheet if unsure what the field name is displayed as.)

Having completed that do the same again for the “Journal” entry.

Now OK out of the sub-styles dialogue and the main styles dialogue windows.

To change to the new style, in the references module select the “Output Style” drop down list and the “Sampler” style which should be displayed near the top of the list.

The Formatted Preview of the reference visible in the preview pane should now include the URL, and the URL will be included in any formatted reference which uses that particular style.

e.g. 1911, Concealment, in: Roget’s Thesaurus (ARTFL. Project, ed.). available from http://machaut.uchicago.edu/?resource=Roget%27s

This brief foray into styles amply illustrates the importance of selecting the correct database fields when compiling an import filter. To populate the wrong fields at such an early stage would compromise the ability to simply utilise the different styles based upon a common database.