Page 1 of 2
Coding: performance in large databases
Posted: Thu Apr 10, 2014 8:04 am
by fboehlandt
Hi
quick question regarding performance: I have a workfile with several thousand time series. I have written some code looping through the time series and calculating their cross correlations. Is there a performance impact from very large databases? Is there a max number of series not to be exceeded as per developer's reference? It seems as if code is executing quite slowly
Thanx
Re: Coding: performance in large databases
Posted: Thu Apr 10, 2014 8:37 am
by EViews Gareth
When you use the word "database", do you actually have a database, or are you just using EViews workfiles?
How have you written the loop?
Re: Coding: performance in large databases
Posted: Fri Apr 11, 2014 3:48 am
by fboehlandt
Hi Gareth
I have a SQL database to store the series. I import to EViews from an Excel extract.
To clarify my problem:
Let us say I have 1,000 time series, each with different starting and end dates.
I need to compare all series with one another to determine their degree of correlation. If r >= 0.99 I woudl like to record both series name in a table (incl. r):
<%series1> <%series2> <r>
I am using an inner and an outer loop to calculate the (1,000 * 1,000) / 2 = 500,000 factors.
The loop is something like this (there are a few more intricacies but they are not relevant in this context):
Code: Select all
!n = {%all}.@count
for !i = 1 to !n
%n={%all}.@seriesname(!i)
for !j = 1 to !n
%n={%all}.@seriesname(!j)
!r=@cor({%m},{%n})
'record somewhere...
next
next
The code executes just fine but takes extremely long to run (which is not entirely unexpected considering the number of parameters).
And here my two questions:
- What is the most performance-efficient way to fo this?
Are there other things I could be doing like reducing the number of series in the workfile?
Re: Coding: performance in large databases
Posted: Fri Apr 11, 2014 7:43 am
by EViews Gareth
Can't you just use group.cor(out=name) to calculate all the correlations in one go?
Re: Coding: performance in large databases
Posted: Tue May 20, 2014 7:05 am
by fboehlandt
Hi Gareth
that will work, I forgot that you can uncheck 'common samples'. I can see the calcs being performed. However, something strange happens once it finishes: I get the 'out of memory' error. Perhaps it is better to execute this from code in quiet mode?
Re: Coding: performance in large databases
Posted: Tue May 20, 2014 7:17 am
by fboehlandt
p.s. Specifically, I am looking for the syntax for Pearson correlation, unmatched samples, something on the line of:
mygroup.corr(...
Re: Coding: performance in large databases
Posted: Tue May 20, 2014 8:51 am
by fboehlandt
Nope, still does it. I have:
freeze(corr_table) group01.cor(pairwise) corr
Runs fine until it gets to creating the table when it exits with 'out of memory'. Any suggestions?
Re: Coding: performance in large databases
Posted: Tue May 20, 2014 11:03 am
by EViews Glenn
How many observations do you have in your workfile? And do you need the table or do you just want to save the results into a symmetric matrix?
Re: Coding: performance in large databases
Posted: Wed May 21, 2014 12:45 am
by fboehlandt
About 7000. I need to get the data into Excel. Is there another way to store and export?
Re: Coding: performance in large databases
Posted: Wed May 21, 2014 7:34 am
by EViews Glenn
In a SYM matrix. Let me run some tests. I still think you might be too big.
Re: Coding: performance in large databases
Posted: Wed May 21, 2014 9:19 am
by fboehlandt
Hi Glenn
cant I generate quadrants of the corr matrix, i.e.:
(1-5000) x (1-5000)
(5001-7000) x (1-5000)
(1-5000) x (5001-7000)
(5001-7000) x (5001 x 7000)
Not sure if covariance analysis makes provisions for that?
Re: Coding: performance in large databases
Posted: Wed May 21, 2014 9:33 am
by EViews Glenn
You could, but computing the off-diagonal blocks isn't built in. If you didn't have unbalanced I'd say you just roll your own using matrix operations but with unbalanced, your loops are the only way I can think of computing at the moment.
To double check...you have 5000 series with 7000 observations each?
Re: Coding: performance in large databases
Posted: Thu May 22, 2014 2:22 am
by fboehlandt
No, I have 7000 series with varying numbes of observation (i.e. unbalanced). The maximum observations would be n = 360. I have no issues using a loop to calculate the correlation factors but the code executes very slowly:
Code: Select all
'set group
%group01 = "group01"
!n = {%group01).@count
!cnt = 1
'loop
for !i = 1 to !n
%n = {%group01}.@seriesname(!i)
table01(!i, 1) = %n
!cnt = !cnt + 1
for !j = !cnt to !n
%m = {%group01}.@seriesname(!j)
table01(1, !j) = %m
!r = @cor({%m}, {%n})
table01(!i + 1, !j + 1) = !r
next
next
I set it up so that it ignores diagonal elements and calculates the top half of the matrix only (to save time). This executes considerably slower than calculating the correlations using covariance analysis of groups. In my opinion, there could be two reasons for that:
- covariance analysis is using a sophisticated routine to estimate correlations (in which case there is nothing I can do)
- writing the results to a table one at a time slowes down execution
If the latter is the case, could this be improved by storing the results in a matrix and writing the results to a table once execution is finished? If so could you perhaps assist in the syntax for matrix declaration and elements. I would then proceed with calculating the 'quadrants' of the matrix as suggested in my previous post. This is a once of exercise so the method does not have to be pretty.
Background: from the workfile of 7000 series I need to identify pairs of series with a correlation of r >= .99. This is one of several sub-processes to identify 'double-reporting' investment funds in a large database containing time series of periodic returns. The series are such that they can have different start and end dates. Some series may not even overlap.
Re: Coding: performance in large databases
Posted: Thu May 22, 2014 7:08 am
by fboehlandt
Okay, I tested with matrix and it does not make much of a difference. This is what I have now:
Code: Select all
'set group
%group01 = "group01"
%group02 = "group01"
!n = {%group01}.@count
!m = {%group02}.@count
!cnt = 1
'loop
for !i = 1 to !n
%n = {%group01}.@seriesname(!i)
dummy.add {%n}
table01(1 + !i, 1) = %n
table01(1, 1 + !i) = %n
!cnt = !cnt + 1
for !j = !cnt to !m
%m = {%group02}.@seriesname(!j)
dummy.add {%m}
!r = @cor({%m},{%n})
if !r <> NA then
smpl if @robs(dummy) = dummy.@count
if @obssmpl <= 12 then
!r = NA
endif
smpl @all
dummy.drop {%m}
endif
table01(!i + 1, !j + 1) = !r
next
dummy.drop {%n}
next
I adjusted the code to accoutn for cross-correlations as well (i.e. 2 groups of series). In addition, I built a check to count the number of joint observations for two series, the reason being that some series might show very high correlation but only share three observations. That is precisely the problem: having two group the two series now slows execution to snail pace. Any input is greatly appeciated
Re: Coding: performance in large databases
Posted: Thu May 22, 2014 8:43 am
by EViews Glenn
If all you want are the indices of the variables with high correlations (possibly with those correlations), I think that can be done with a bit of work. I'm away from my computer. Let me look at it.