How to choose different filter when doing a Direct Export

Hello,

I am able to set the Direct Export working in Biblioscape 8. In IE, when I click a link that generates .ris file, it will be directly opened by Biblioscape, and record is put in the Reference folder. I think this is what it is supposed to be.

The problem I am facing is, I constantly check on different websites that generate .ris files with different formats. To import the data correctly, I need to select the correct filter when doing the import.

Currently, I have to disable Direct Export, by manually (in Windows) associate .ris filetype to WordPad program. Then, when I click the link what generate .ris file, windows will prompt me to let me save the file to disk. I then use 'File | Import...' to open the saved .ris file, and in the 'Import References' window I got chance to choose the correct filter. This works, but is very low efficient.

My question is can I use Direct Export, but Biblioscape stops at the 'Import References' window to let me choose different filters?

Thank you very much !

That is possible. But not

That is possible. But not the best solution. Biblioscape import filter supports regular expression. I know RIS format is loosely defined. It could be slightly different from different source. I think it is possible to create an import filter that fits all. You can post a few records to show the problem, I will try to modify the import filter to make it work for different cases.

Hi, Paul, Thank you for your

Hi, Paul,
Thank you for your willingness to solve this problem. I will try to describe the problem clearly below, although it might be a little long :-O

The inconsistance of the ris format across different publishers in my field (Physics) can be shown roughly in these two areas:
1. The endpage, and
2. the doi information.

===============================================================
Problem #1: how to indicate endpage, situation is

type A: there is no endpage, standard engpage tag 'EP - ' is used for number of pages
type B: an unconventional tag 'LP - ' is used, 'EP - ' tag does not appear
type C: normal tag 'EP - '

Example: (only the relevent part is shown)
type A:
SP - 212301
EP - 4

type B:
SP - 8434
LP - 8439

type C:
SP - 4835
EP - 4837

So the problem is, I can setup rule in the general import filter like:
endpage = LP..-.^[+]^EP..-.
to catch type B and C, but then type A will incorrectly put the information of number of pages into endpage field.

But there is hope. For type A, their SP field always has 6 digits, but type B and C at most have 5 digits in their SP field.
So the logic needed here is: (borrowing C++ language-like notation)

if ( length(SP) == 6 ) { // captures type A
misc = EP // record number of pages information to some misc field
}
else { // type B and C
endpage = LP..-.^[+]^EP..-.
}

I am not sure whether this kind of logic can be expressed by Regular Expression.

===================================================
Problem 2: how to get doi information

type A: no doi field, doi info can be extracted from 'ER - ' tag
by getting rid of the prefix "http://dx.doi.org/"
ex:
ER - http://dx.doi.org/10.1103/PhysRevB.78.212301

type B: doi in tag 'ID - '
ex:
ID - 10.1103/PhysRevB.46.8434

type C: no doi info, but has an irrelevent 'ID - ' tag
ex:
ID - 0295-5075-86-3-37002

type D: doi in tags 'N1 - ' and 'M3 - '
ex:
N1 - 10.1038/nature07816
M3 - 10.1038/nature07816

type E: doi in tag 'N1 - ', but extra info is present in data
ex:
N1 - doi: 10.1021/nl034196n

type F: doi in tag 'M3 - ', with extra info
ex:
M3 - doi: DOI: 10.1016/S0379-6779(98)00278-1

In summary, doi could be in tags: ER, ID, N1, and M3, could be also non-exist, but these tags might mean different thing in different ris files (like type C), or duplicated in two tags (type D), or some contamination is in the data field (type A, E, F)

Here, RegEx might be able to help.

=====================================================

I am not sure whether there is solution within current biblioscape setup. Even there is, it might be too hard and not worth doing it. Even without any solution, any thought on this is greatly appreciated.

P.S.
The reason I choose ris is that some site can only provide ris citation download (like nature.com), and seems this format is more common than enw (EndNote) file.

Here is an alternative question: can Biblioscape direct export handle different filetype, i.e., enw and ris? If so, I can write the generic ris filter for the sites that only offers ris download. This might be easier.

Almost solved

I have been reading the Biblioscape user guide and getting more practice of the regex, it turns out that the two problems I raised last weeks can both be solved with the help of regex and the "Replace or Remove" rules for a filter.

Here, I just want to record what has been done to solve the two problems. I also encountered another problem that by far I still couldn't solve, maybe someone can help?

1. problem was endpage. Some journal articles do not have endpage information, some use tag LP to indicate endpage, some use EP, even some others use EP to indicate how many pages does an article have.
Since we know if SP is in the form of 6 digits, then if there is an EP tag, it must means the how many pages. The solution is
1) let endpage be the match of "SP - ^[+]^EP - ^[+]^LP - "
2) in "Replace or Remove", choose endpage field, the match RE(\d{6} .*|\d{1,5} )RE for removal. This way, if SP is 6 digits, all of the endpage field matches, and there is nothing left for endpage; if SP is not, only the SP part of the endpage is matched, and the real endpage remains.

2. Similarly, one can do for doi. doi information is indicated among different vendors by tags like ID, N1, M3, and ER.

We can
1) capture all these information first, by matching the following to doi: "ER - ^[+]^ID - ^[+]^N1 - ^[+]^M3 - "
2) write "Replace or Remove" rules to get rid of the contamination infos like "doi: " or "http://dx.doi.org/"
One can write several rules in sequence. Thus one don't need to device a complicated regex rule to do it in one place.

Now all my problems are solved, but I have another one, and I will post it to another thread, soon.

I am glad to see the power

I am glad to see the power of regular expression is used. Once you have finished fine tuning the import filter, I hope you can share it with other users. Thanks, Paul

setback

Hi, Paul,

I would like to know which regex engine is used for Biblioscape 8.

The regex I put into biblioscape is not working as it should be. I have two examples.

1st example.

(.*) (?=\1)
acts on the following line should capture the first two words with the trailing space
Physical Reviews Physical Reviews
i.e., it should capture "Physical Reviews "
But I setup a rule in "Replace or Remove" tab as searching for RE((.*) (?=\1))RE and replace with blank, nothing happened.

another example.

.*(?=[12]\d{3})
act on the following line should capture everything before the 4-digit year
Aug. 13, 1990
i.e., it should capture "Aug. 13, "
But a rule RE(.*(?=[12]\d{3}))RE only captures "Aug. "

I have tested both regex using online tool:
http://www.fileformat.info/tool/regex.htm
and both worked very well. But seems something is wrong when these regex were implemented in biblioscape.

I do need to point out that both regex use "lookahead" feature/syntax of regex.

Any hint?

I don't know which regular

I don't know which regular expression engine is used. It may not support the lookahead feature. I cannot get that information now. I will fint it later. Thanks, Paul

Regexpstudio.com

Sorry for the late postings, I am busy with other things.

Regexpstudio.com is the regex flavour that was used in the earlier versions of Biblioscape according to the v.6 help file. 

Has that been changed Paul?

For any users who recognise a knowledge requirement I personally have found Mastering Regular Expressions by Jeffrey Friedl to give a reasonably comprehensive coverage of regular expressions.

Ian

Ian, you can right about the

Ian, you can right about the regular expression engine. That escaped me last time.

According to this wiki

According to this wiki page,
http://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines
if the "lookahead" feature is not implemented, biblioscape probably chooses TRE version of regular expression http://laurikari.net/tre/

I would also like to know

I would also like to know more about the engine and the supported features. In fact, can you please consider adding this information (and the one on filter editing in general, I am thinking of MARC 21 XML in particular as mentioned elsewhere) to the documentation. Thanks.