Coding: performance in large databases
Moderators: EViews Gareth, EViews Moderator, EViews Jason, EViews Matt
-
fboehlandt
- Posts: 83
- Joined: Thu Apr 15, 2010 3:54 am
Coding: performance in large databases
Hi
quick question regarding performance: I have a workfile with several thousand time series. I have written some code looping through the time series and calculating their cross correlations. Is there a performance impact from very large databases? Is there a max number of series not to be exceeded as per developer's reference? It seems as if code is executing quite slowly
Thanx
quick question regarding performance: I have a workfile with several thousand time series. I have written some code looping through the time series and calculating their cross correlations. Is there a performance impact from very large databases? Is there a max number of series not to be exceeded as per developer's reference? It seems as if code is executing quite slowly
Thanx
-
EViews Gareth
- Fe ddaethom, fe welon, fe amcangyfrifon
- Posts: 13585
- Joined: Tue Sep 16, 2008 5:38 pm
Re: Coding: performance in large databases
When you use the word "database", do you actually have a database, or are you just using EViews workfiles?
How have you written the loop?
How have you written the loop?
-
fboehlandt
- Posts: 83
- Joined: Thu Apr 15, 2010 3:54 am
Re: Coding: performance in large databases
Hi Gareth
I have a SQL database to store the series. I import to EViews from an Excel extract.
To clarify my problem:
Let us say I have 1,000 time series, each with different starting and end dates.
I need to compare all series with one another to determine their degree of correlation. If r >= 0.99 I woudl like to record both series name in a table (incl. r):
<%series1> <%series2> <r>
I am using an inner and an outer loop to calculate the (1,000 * 1,000) / 2 = 500,000 factors.
The loop is something like this (there are a few more intricacies but they are not relevant in this context):
The code executes just fine but takes extremely long to run (which is not entirely unexpected considering the number of parameters).
And here my two questions:
I have a SQL database to store the series. I import to EViews from an Excel extract.
To clarify my problem:
Let us say I have 1,000 time series, each with different starting and end dates.
I need to compare all series with one another to determine their degree of correlation. If r >= 0.99 I woudl like to record both series name in a table (incl. r):
<%series1> <%series2> <r>
I am using an inner and an outer loop to calculate the (1,000 * 1,000) / 2 = 500,000 factors.
The loop is something like this (there are a few more intricacies but they are not relevant in this context):
Code: Select all
!n = {%all}.@count
for !i = 1 to !n
%n={%all}.@seriesname(!i)
for !j = 1 to !n
%n={%all}.@seriesname(!j)
!r=@cor({%m},{%n})
'record somewhere...
next
nextAnd here my two questions:
- What is the most performance-efficient way to fo this?
Are there other things I could be doing like reducing the number of series in the workfile?
-
EViews Gareth
- Fe ddaethom, fe welon, fe amcangyfrifon
- Posts: 13585
- Joined: Tue Sep 16, 2008 5:38 pm
Re: Coding: performance in large databases
Can't you just use group.cor(out=name) to calculate all the correlations in one go?
-
fboehlandt
- Posts: 83
- Joined: Thu Apr 15, 2010 3:54 am
Re: Coding: performance in large databases
Hi Gareth
that will work, I forgot that you can uncheck 'common samples'. I can see the calcs being performed. However, something strange happens once it finishes: I get the 'out of memory' error. Perhaps it is better to execute this from code in quiet mode?
that will work, I forgot that you can uncheck 'common samples'. I can see the calcs being performed. However, something strange happens once it finishes: I get the 'out of memory' error. Perhaps it is better to execute this from code in quiet mode?
-
fboehlandt
- Posts: 83
- Joined: Thu Apr 15, 2010 3:54 am
Re: Coding: performance in large databases
p.s. Specifically, I am looking for the syntax for Pearson correlation, unmatched samples, something on the line of:
mygroup.corr(...
mygroup.corr(...
-
fboehlandt
- Posts: 83
- Joined: Thu Apr 15, 2010 3:54 am
Re: Coding: performance in large databases
Nope, still does it. I have:
freeze(corr_table) group01.cor(pairwise) corr
Runs fine until it gets to creating the table when it exits with 'out of memory'. Any suggestions?
freeze(corr_table) group01.cor(pairwise) corr
Runs fine until it gets to creating the table when it exits with 'out of memory'. Any suggestions?
-
EViews Glenn
- EViews Developer
- Posts: 2682
- Joined: Wed Oct 15, 2008 9:17 am
Re: Coding: performance in large databases
How many observations do you have in your workfile? And do you need the table or do you just want to save the results into a symmetric matrix?
-
fboehlandt
- Posts: 83
- Joined: Thu Apr 15, 2010 3:54 am
Re: Coding: performance in large databases
About 7000. I need to get the data into Excel. Is there another way to store and export?
-
EViews Glenn
- EViews Developer
- Posts: 2682
- Joined: Wed Oct 15, 2008 9:17 am
Re: Coding: performance in large databases
In a SYM matrix. Let me run some tests. I still think you might be too big.
-
fboehlandt
- Posts: 83
- Joined: Thu Apr 15, 2010 3:54 am
Re: Coding: performance in large databases
Hi Glenn
cant I generate quadrants of the corr matrix, i.e.:
(1-5000) x (1-5000)
(5001-7000) x (1-5000)
(1-5000) x (5001-7000)
(5001-7000) x (5001 x 7000)
Not sure if covariance analysis makes provisions for that?
cant I generate quadrants of the corr matrix, i.e.:
(1-5000) x (1-5000)
(5001-7000) x (1-5000)
(1-5000) x (5001-7000)
(5001-7000) x (5001 x 7000)
Not sure if covariance analysis makes provisions for that?
-
EViews Glenn
- EViews Developer
- Posts: 2682
- Joined: Wed Oct 15, 2008 9:17 am
Re: Coding: performance in large databases
You could, but computing the off-diagonal blocks isn't built in. If you didn't have unbalanced I'd say you just roll your own using matrix operations but with unbalanced, your loops are the only way I can think of computing at the moment.
To double check...you have 5000 series with 7000 observations each?
To double check...you have 5000 series with 7000 observations each?
-
fboehlandt
- Posts: 83
- Joined: Thu Apr 15, 2010 3:54 am
Re: Coding: performance in large databases
No, I have 7000 series with varying numbes of observation (i.e. unbalanced). The maximum observations would be n = 360. I have no issues using a loop to calculate the correlation factors but the code executes very slowly:
I set it up so that it ignores diagonal elements and calculates the top half of the matrix only (to save time). This executes considerably slower than calculating the correlations using covariance analysis of groups. In my opinion, there could be two reasons for that:
- covariance analysis is using a sophisticated routine to estimate correlations (in which case there is nothing I can do)
- writing the results to a table one at a time slowes down execution
If the latter is the case, could this be improved by storing the results in a matrix and writing the results to a table once execution is finished? If so could you perhaps assist in the syntax for matrix declaration and elements. I would then proceed with calculating the 'quadrants' of the matrix as suggested in my previous post. This is a once of exercise so the method does not have to be pretty.
Background: from the workfile of 7000 series I need to identify pairs of series with a correlation of r >= .99. This is one of several sub-processes to identify 'double-reporting' investment funds in a large database containing time series of periodic returns. The series are such that they can have different start and end dates. Some series may not even overlap.
Code: Select all
'set group
%group01 = "group01"
!n = {%group01).@count
!cnt = 1
'loop
for !i = 1 to !n
%n = {%group01}.@seriesname(!i)
table01(!i, 1) = %n
!cnt = !cnt + 1
for !j = !cnt to !n
%m = {%group01}.@seriesname(!j)
table01(1, !j) = %m
!r = @cor({%m}, {%n})
table01(!i + 1, !j + 1) = !r
next
next- covariance analysis is using a sophisticated routine to estimate correlations (in which case there is nothing I can do)
- writing the results to a table one at a time slowes down execution
If the latter is the case, could this be improved by storing the results in a matrix and writing the results to a table once execution is finished? If so could you perhaps assist in the syntax for matrix declaration and elements. I would then proceed with calculating the 'quadrants' of the matrix as suggested in my previous post. This is a once of exercise so the method does not have to be pretty.
Background: from the workfile of 7000 series I need to identify pairs of series with a correlation of r >= .99. This is one of several sub-processes to identify 'double-reporting' investment funds in a large database containing time series of periodic returns. The series are such that they can have different start and end dates. Some series may not even overlap.
-
fboehlandt
- Posts: 83
- Joined: Thu Apr 15, 2010 3:54 am
Re: Coding: performance in large databases
Okay, I tested with matrix and it does not make much of a difference. This is what I have now:
I adjusted the code to accoutn for cross-correlations as well (i.e. 2 groups of series). In addition, I built a check to count the number of joint observations for two series, the reason being that some series might show very high correlation but only share three observations. That is precisely the problem: having two group the two series now slows execution to snail pace. Any input is greatly appeciated
Code: Select all
'set group
%group01 = "group01"
%group02 = "group01"
!n = {%group01}.@count
!m = {%group02}.@count
!cnt = 1
'loop
for !i = 1 to !n
%n = {%group01}.@seriesname(!i)
dummy.add {%n}
table01(1 + !i, 1) = %n
table01(1, 1 + !i) = %n
!cnt = !cnt + 1
for !j = !cnt to !m
%m = {%group02}.@seriesname(!j)
dummy.add {%m}
!r = @cor({%m},{%n})
if !r <> NA then
smpl if @robs(dummy) = dummy.@count
if @obssmpl <= 12 then
!r = NA
endif
smpl @all
dummy.drop {%m}
endif
table01(!i + 1, !j + 1) = !r
next
dummy.drop {%n}
next-
EViews Glenn
- EViews Developer
- Posts: 2682
- Joined: Wed Oct 15, 2008 9:17 am
Re: Coding: performance in large databases
If all you want are the indices of the variables with high correlations (possibly with those correlations), I think that can be done with a bit of work. I'm away from my computer. Let me look at it.
Who is online
Users browsing this forum: No registered users and 2 guests
