Coding: performance in large databases

fboehlandt · Postby **fboehlandt** » Thu Apr 10, 2014 8:04 am

Hi

quick question regarding performance: I have a workfile with several thousand time series. I have written some code looping through the time series and calculating their cross correlations. Is there a performance impact from very large databases? Is there a max number of series not to be exceeded as per developer's reference? It seems as if code is executing quite slowly
Thanx

EViews Gareth · Postby **EViews Gareth** » Thu Apr 10, 2014 8:37 am

When you use the word "database", do you actually have a database, or are you just using EViews workfiles?

How have you written the loop?

fboehlandt · Postby **fboehlandt** » Fri Apr 11, 2014 3:48 am

Hi Gareth

I have a SQL database to store the series. I import to EViews from an Excel extract.

To clarify my problem:
Let us say I have 1,000 time series, each with different starting and end dates.

I need to compare all series with one another to determine their degree of correlation. If r >= 0.99 I woudl like to record both series name in a table (incl. r):
<%series1> <%series2> <r>

I am using an inner and an outer loop to calculate the (1,000 * 1,000) / 2 = 500,000 factors.

The loop is something like this (there are a few more intricacies but they are not relevant in this context):

Code: Select all

!n = {%all}.@count
for !i = 1 to !n
%n={%all}.@seriesname(!i)
for !j = 1 to !n
%n={%all}.@seriesname(!j)
!r=@cor({%m},{%n})
'record somewhere...
next
next

The code executes just fine but takes extremely long to run (which is not entirely unexpected considering the number of parameters).

And here my two questions:

What is the most performance-efficient way to fo this?
Are there other things I could be doing like reducing the number of series in the workfile?

EViews Gareth · Postby **EViews Gareth** » Fri Apr 11, 2014 7:43 am

Can't you just use group.cor(out=name) to calculate all the correlations in one go?

fboehlandt · Postby **fboehlandt** » Tue May 20, 2014 7:05 am

Hi Gareth

that will work, I forgot that you can uncheck 'common samples'. I can see the calcs being performed. However, something strange happens once it finishes: I get the 'out of memory' error. Perhaps it is better to execute this from code in quiet mode?

fboehlandt · Postby **fboehlandt** » Tue May 20, 2014 7:17 am

p.s. Specifically, I am looking for the syntax for Pearson correlation, unmatched samples, something on the line of:

mygroup.corr(...

fboehlandt · Postby **fboehlandt** » Tue May 20, 2014 8:51 am

Nope, still does it. I have:

freeze(corr_table) group01.cor(pairwise) corr

Runs fine until it gets to creating the table when it exits with 'out of memory'. Any suggestions?

EViews Glenn · Postby **EViews Glenn** » Tue May 20, 2014 11:03 am

How many observations do you have in your workfile? And do you need the table or do you just want to save the results into a symmetric matrix?

fboehlandt · Postby **fboehlandt** » Wed May 21, 2014 12:45 am

About 7000. I need to get the data into Excel. Is there another way to store and export?

EViews Glenn · Postby **EViews Glenn** » Wed May 21, 2014 7:34 am

In a SYM matrix. Let me run some tests. I still think you might be too big.

fboehlandt · Postby **fboehlandt** » Wed May 21, 2014 9:19 am

Hi Glenn

cant I generate quadrants of the corr matrix, i.e.:

(1-5000) x (1-5000)
(5001-7000) x (1-5000)
(1-5000) x (5001-7000)
(5001-7000) x (5001 x 7000)

Not sure if covariance analysis makes provisions for that?

EViews Glenn · Postby **EViews Glenn** » Wed May 21, 2014 9:33 am

You could, but computing the off-diagonal blocks isn't built in. If you didn't have unbalanced I'd say you just roll your own using matrix operations but with unbalanced, your loops are the only way I can think of computing at the moment.

To double check...you have 5000 series with 7000 observations each?

fboehlandt · Postby **fboehlandt** » Thu May 22, 2014 2:22 am

No, I have 7000 series with varying numbes of observation (i.e. unbalanced). The maximum observations would be n = 360. I have no issues using a loop to calculate the correlation factors but the code executes very slowly:

Code: Select all

'set group
%group01 = "group01"
!n = {%group01).@count
!cnt = 1

'loop
for !i = 1 to !n
	%n = {%group01}.@seriesname(!i)
	table01(!i, 1) = %n
	!cnt = !cnt + 1
	for !j = !cnt to !n
		%m = {%group01}.@seriesname(!j)
		table01(1, !j) = %m
		!r = @cor({%m}, {%n})
		table01(!i + 1, !j + 1) = !r
	next
next

I set it up so that it ignores diagonal elements and calculates the top half of the matrix only (to save time). This executes considerably slower than calculating the correlations using covariance analysis of groups. In my opinion, there could be two reasons for that:

- covariance analysis is using a sophisticated routine to estimate correlations (in which case there is nothing I can do)
- writing the results to a table one at a time slowes down execution

If the latter is the case, could this be improved by storing the results in a matrix and writing the results to a table once execution is finished? If so could you perhaps assist in the syntax for matrix declaration and elements. I would then proceed with calculating the 'quadrants' of the matrix as suggested in my previous post. This is a once of exercise so the method does not have to be pretty.

Background: from the workfile of 7000 series I need to identify pairs of series with a correlation of r >= .99. This is one of several sub-processes to identify 'double-reporting' investment funds in a large database containing time series of periodic returns. The series are such that they can have different start and end dates. Some series may not even overlap.

fboehlandt · Postby **fboehlandt** » Thu May 22, 2014 7:08 am

Okay, I tested with matrix and it does not make much of a difference. This is what I have now:

Code: Select all

'set group
%group01 = "group01"
%group02 = "group01"
!n = {%group01}.@count
!m = {%group02}.@count
!cnt = 1

'loop
for !i = 1 to !n
	%n = {%group01}.@seriesname(!i)
	dummy.add {%n}
	table01(1 + !i, 1) = %n
	table01(1, 1 + !i) = %n
	!cnt = !cnt + 1
	for !j = !cnt to !m
		%m = {%group02}.@seriesname(!j)
		dummy.add {%m}	
		!r = @cor({%m},{%n})
		if !r <> NA then
			smpl if @robs(dummy) = dummy.@count
			if @obssmpl <= 12 then
				!r = NA
			endif
			smpl @all		
			dummy.drop {%m}
		endif
		table01(!i + 1, !j + 1) = !r		
	next
	dummy.drop {%n}
next

I adjusted the code to accoutn for cross-correlations as well (i.e. 2 groups of series). In addition, I built a check to count the number of joint observations for two series, the reason being that some series might show very high correlation but only share three observations. That is precisely the problem: having two group the two series now slows execution to snail pace. Any input is greatly appeciated

EViews Glenn · Postby **EViews Glenn** » Thu May 22, 2014 8:43 am

If all you want are the indices of the variables with high correlations (possibly with those correlations), I think that can be done with a bit of work. I'm away from my computer. Let me look at it.

EViews.com

Coding: performance in large databases

Coding: performance in large databases

Re: Coding: performance in large databases

Re: Coding: performance in large databases

Re: Coding: performance in large databases

Re: Coding: performance in large databases

Re: Coding: performance in large databases

Re: Coding: performance in large databases

Re: Coding: performance in large databases

Re: Coding: performance in large databases

Re: Coding: performance in large databases

Re: Coding: performance in large databases

Re: Coding: performance in large databases

Re: Coding: performance in large databases

Re: Coding: performance in large databases

Re: Coding: performance in large databases

Who is online