RSS for Researchers

I would like to introduce RSS feeds and RSS readers, which I believe are very useful research tools in the internet era (RSS entry @ Wikipedia).

The problem first. I only subscribe a few journals personally. To browse latest articles published in many other journals related to my research areas, I need to check the corresponding websites regularly. To check the latest issues from ten different journals, I need to visit ten different webpages. Alternatively, I can subscribe the email notification services, which have been available for a long time for many journals. However, each email is itself one object, with several entries (articles) inside. It is not easy to organize the entries. How about combining the lists of articles together into one single list?

Put it simply, each RSS feed is a list of entries provided by a server, e.g., news stories provided by a newspaper website, or article abstracts provided by a journal publisher. An RSS reader "pulls" the data from the RSS feed providers, and shows them to the user. You can consider an RSS Reader as your personalized magazine, with data from various magazines, journals, newspapers, websites, and other content providers, all collected and displayed in one place.

With an RSS reader, I only need to read my reader daily, or weekly, or whatever schedule I like. The latest issues will appear when they are available. No need to have a long list of websites for all journals I want to scan, and no need to remember or check the release schedule of each journal. Moreover, RSS readers usually allow the users to tag (label) and save an entry. Therefore, I can keep and file articles that I am interested in for later access. This is difficult to do in email notifications.

As you may notice in the diagram I prepared, it is not only for obtaining the latest issue tables of content from journals. Nowadays, many content providers have RSS feeds. Many societies have RSS feeds for their latest news, many news websites have feeds for their news stories, many discussion groups have feeds for the latest posts, and feed is nearly one of the standard services a blog should have. Therefore, RSS reader is actually a one-stop personalized news service for you to combine news that you want to keep track of.

A quick online search will return many online how-to guides on using RSS feeds and RSS readers (also known as RSS aggregators). I think the best way to learn how to use RSS feeds is to learn how to use an RSS reader. I myself have tried various RSS readers, and found two of them suit my needs. One is the older version of FeedDemon (not 3.0), which needs to be downloaded and installed on a computer. The other one is Google Reader, which is an online RSS reader that I can access in any computer with internet access. They may not be the best, so you need to try yourself to see which one suits your own needs. Google has an interesting video that illustrates how to use Google Reader, which I think also illustrates the idea of RSS feeds in general:

To illustrate how I use RSS reader for research, these are some sites with feeds I need:

Next time you visit a website, look for the RSS or XML icons. There are fewer and fewer websites that do not provide an RSS feed. :)

Article: "Potential Problems in the Statistical Control of Variables ..." by Becker 2005

Article: Becker, T. E. (2005). Potential problems in the statistical control of variables in organizational research: A qualitative analysis with recommendations. Organizational Research Methods, 8, 274-289. [Abstract]

In psychological studies, it is common to include variables as "control variables," for example, age, gender, educational level, and other similar variables, usually demographic variables but sometimes other variables specific to a particular research context. The researchers believe that the effects of these variables should be "controlled for" before investigating the predictive power of the variables of interest. This is a practice so common that, sometimes we (including me) just think this should be done, without really asking ourselves why.

Becker reviewed a random sample of 60 articles from four top journals, and summarized the problems found in the common practice of using control variables. Among the various problems highlighted, I think the most important one is the lacking of explanation. If the control variables are correlated with the proposed predictors but are entered first, we are letting the control variables to "claim" the predictive power shared by the control variables and the predictors. The question is whether attributing this shared effect to the control variables is theoretically justified.

For example, consider this hypothetical situation. Assume we have two predictors of salary, intelligence and educational level attained. Usually, we will include educational level as a control variable. However, what if educational level is actually (at least partly) influenced by intelligence, that is, higher the intelligence, higher the educational level attained? If this is the case, then the path is intelligence->educational level->salary. According to the view on hierarchical regression analysis by Cohen, Cohen, West, and Aiken (2003), predictors entered in a subsequent step must not be a cause of predictors of variables entered in previous steps. If we adopt this perspective, intelligence should be entered first, not educational level.

Alternatively, we can understand this hypothetical case as a mediation model. The R-square in the first step, with intelligence only, is the total effect of intelligence (direct effect on salary plus indirect effect through educational level). If educational level is actually influenced by intelligence but is entered first, the R-square in the first step is not the total effect of educational level, as it is confounded by the effect of intelligence.

Certainly, Cohen et al.'s perspective is not the only one in using hierarchical regression analysis. For example, sometimes we need to enter several variables first because they are well-established predictors, and we need to demonstrate that the predictors we propose have additional contribution over the well-established predictors. In this case, we are not asserting a causal order for the well-established predictors and the predictors we propose. The order is more a practical one, partly determined by what predictors happen to be studied first.

Nevertheless, Becker reminds us that the inclusion of control variables and the order of entry, like all other variables in the regression analysis, should be justified theoretically.

However, the normative pressure is strong, even in the research community. Anyway, I think it does not hurt if all we need is to think more. :)

Reference

Cohen, J., Cohen, P., West, S. G.,& Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences. NJ: Lawrence Erlbaum Associates.

CiteULike

One frequent task in doing research is searching journal articles. In the old days without electronic copies and easy-to-use database applications, we relied on file cabinets, hanging folders, index cards, etc. to file the numerous articles that accumulate. Recently, I found that there is a free (as of 2008-06-30) online service that is very useful in the era of internet: CiteULike. According to the FAQ of the website, "CiteULike is a free service to help you to store, organise and share the scholarly papers you are reading." I am a new user, and have not yet decided whether I will use it as my main platform to file articles in the future. Nevertheless, I would like to share my initial experience here, so other visitors may try and see if it is good for them or not. As the FAQ from the website is very detailed, instead of repeating the information here, I would like to share how I am using the service.

Nowadays, nearly all major academic databases are accessible online with subscription. Moreover, most major publishers have their own websites and online tables of content for their journals. With CiteULike, when I find a useful article in the web browser, I can post it to my CiteULike personal database (library), and tag it with my own keywords. (For users of del.icio.us, CiteULike is del.icio.us for journal articles.) If the database or publisher website supports CiteULike, then the basic information, such as title, abstract, authors and journal, will be entered automatically. Even if the website does not support CiteULike, it is not very difficult to enter (copy and paste) the information online manually. I can then search my library for articles using the keywords. If several articles are related to the same project, I may also create a keyword for this project and tag all articles related to this project with the project keyword, creating a bibliography for this particular project.

Another important aspect of CiteULike is social. I can find articles other users posted, and may also find other users who have posted an article that I posted to my library. In my search, I have successfully found some articles posted by other users and useful to my projects, by reading the libraries of other users who posted those articles in my library and are critical to a particular topic. I would have missed those articles, as they are related to the topic but not in my discipline. Conversely, I can also share my library, and let other people find articles I collected for a topic.

Despite the social aspect of CiteULike, a user can, for whatever reasons, mark an article as private. It is still in the personal library, but only the user will see this entry. Therefore, even if a user does not like the idea of letting others know what articles are being cataloged, CiteULike still serves very well as an online private "secret" library. For me, I will use CiteULike as both a private and public databases, as occasionally there are technical reasons that some lists are for private use, or not yet ready for public sharing.

There are certainly other well-developed alternatives, such as EndNote and RefWorks. I myself also have been using a self-created Microsoft Access databases for nearly ten years, and likely will keep using it. Nevertheless, CiteULike is easy to use, accessible anywhere, and free. Even my self-created Access database is not really free, as I need to write and debug the codes myself, which can be quite time consuming.

I think I will still use my personal Access database for a long time due to the number of entries I have accumulated over the years. For CiteULike, I will use it mainly for sharing articles that I collected for selected topics. That is why the number of articles in my library accumulated in the last few months is very small. I will continue to try CiteULike for several months, to see if it really suits my need, and whether I will migrate to CiteULike and use it as my main platform for cataloging articles.

High Cronbach Alpha Supports a One-Factor Structure? Probably, but Not Necessarily

In my experience, some students believe that if the Cronbach alpha of a scale is high, then we can say that the scale measures one factor. In this short post, I will illustrate why high alpha is not sufficient to support a one-factor structure. I am not going to explain why. For readers who are interested to know more, I strongly recommend the paper by Jose M. Cortina (1993), What is coefficient alpha? An examination of theory and applications, published in Journal of Applied Psychology. Most of the ideas here are based on Cortina's article and related previous papers.

High Cronbach's Alpha: Dataset 1

Dataset 1 (CSV, Excel) is a sample of simulated scores with 1000 cases, 14 variables (X1 to X14). Let's assume they are scores on 14 items, and we suspect they measure the same construct. The Cronbachs' alpha for this 14-item scale is .79. Look good, right? This is the table of factor loadings for this 14 items (principal axis factoring, one-factor requested):

The one-factor structure seems to be acceptable, though most of the primary loadings are only around .40.


Factor 1

X1

.45

X2

.41

X3

.46

X4

.47

X5

.49

X6

.46

X7

.37

X8

.48

X9

.45

X10

.49

X11

.46

X12

.50

X13

.46

X14

.45

(In this and following examples, we will focus on the factor loadings only, although in practice we will also examine other information, such as eigenvalues and percentage of variance explained.)

Let's examine the two-correlated-factor structure using principal axis factoring and direct oblimin (delta=0). This is the loading (pattern) table:

The estimated correlation between Factor 1 and Factor 2 is .26. It seems that a two-factor structure with corrlated factors also fits the data. The higher primary loadings (around .60 vs. around .40) and the substantially larger percentage of variance explained (not reported here) actually suggest that the two-correlated-factor structure is more plausible than the one-factor structure (the two-correlated-factor is indeed the population model used to generate the 1000 cases of random data).


Factor 1

Factor 2

X1

.02

.57

X2

-.02

.57

X3

-.01

.62

X4

.00

.62

X5

.03

.61

X6

.03

.57

X7

-.04

.52

X8

.60

.00

X9

.60

-.03

X10

.57

.04

X11

.59

.00

X12

.62

.02

X13

.56

.01

X14

.60

-.03

The Cronbach's alpha of X1 to X7 is .78, and that of X8 to X14 is .79. Note that these two subsets have only half the number of items of the whole scale (7 items vs. 14 items), but with approximately the same Cronbach's alpha.

So, this example illustrates that it is possible to have high Cronbach's alpha even when the items measure two different, though correlated, constructs.

High Cronbach's Alpha: Dataset 2

Dataset 2 (CSV, Excel) has 1000 cases of simulated scores, 14 variables (X1 to X14). Again, assume we believes the 14 items measure a single construct. The Cronbachs' alpha for this 14-item scale is .85. Look good, and even higher than that in Dataset 1. Should we stop here and claim that the 14-item scale measures one single construct?

This is the table of factor loadings for this 14 items (principal axis factoring, one-factor requested):

The pattern is strange. It does not suggest one single factor, depsite an alpha higher than that in Dataset 1.


Factor 1

X1

-.18

X2

-.18

X3

-.21

X4

-.20

X5

-.18

X6

-.16

X7

-.16

X8

.79

X9

.79

X10

.78

X11

.80

X12

.76

X13

.78

X14

.78

How about a two-correlated-factor structure? This is the loading (pattern) table (Principal axis factoring, direct oblimin, delta=0):

Each item only has one primary loading, and the correlation between Factor 1 and Factor 2 is small (-.03). The factor analysis results actually suggest a two-factor structure.


Factor 1

Factor 2

X1

.00

.78

X2

.00

.79

X3

-.04

.80

X4

-.02

.81

X5

.00

.80

X6

.03

.81

X7

.02

.79

X8

.80

-.03

X9

.81

.00

X10

.80

.00

X11

.82

.01

X12

.80

.05

X13

.79

-.02

X14

.80

-.01

The Cronbach's alpha of X1 to X7 (Factor 1 items) is .92, and that of X8 to X14 (Factor 2 items) is .93, both higher than then the Cronbach's alpha of X1 to X14 (.85). Interestingly, the two factors seem to be uncorrelated!

So, this example illustrates that it is possible to have high Cronbach's alpha even when the items measure two uncorrelated constructs.

Low Cronbach's Alpha: Dataset 3

Dataset 3 (CSV, Excel) is a sample of simulated scores with 1000 cases, 14 variables (X1 to X14). Let's assume they are scores on 14 items. We focus only on X1 to X7 first, and we suspect they measure the same construct. The Cronbachs' alpha for this 7-item scale is .58. Look unsatisfactory, right? This is the table of factor loadings for this 7 items (principal axis factoring, one-factor requested):

Only four loadings are higher than .40. The seven items do not seem to have one common factor ... do they?

 

Factor 1

X1

.42

X2

.37

X3

.39

X4

.49

X5

.40

X6

.31

X7

.46

In this example, assume a researcher, despite the low Cronbach's alpha, strongly believes that all seven items measure one single factor. The researcher conducts a confirmatory factor analysis, and this is the result (from LISREL):

The one-factor model fits very well! The chi-square test is nonsignificant (remember the sample size is 1000). That is, based on the common standards for confirmatory factor analysis, we fail to reject the one-factor model. This is natural, because the data is indeed simulated from a one-factor model, all loadings set to .40.

So, this example illustrates that it is possible to have low Cronbach's alpha even when the items indeed have only one underlying construct.

Note that it does not mean a low Cronbach's alpha is not an problem. This example just shows that, low Cronbach's alpha does not provide direct support that the items do not measure one single construct. Low Cronbach's alpha, in this case, tells us that the proportion of shared variance is small, or alternatively, the proportion of error variance is large, had we combined the item scores to form a scale score and measure the underlying factor. In other words, we are using the correct measure, although a not so reliable one.

So?

Strictly speaking, Cronbach's alpha is a characteristic of the scale score. It is a measure of the proportion of true score variance in the total score variance. It is not a measure of dimensionality. As illustrated above, high Cronbach's alpha is neither a sufficient nor a necessary condition of a one-factor structure.

That said, I must admit that sometimes I may also implicitly (and incorrectly) suggest a one-factor structure based on a high Cronbach's alpha. Nevertheless, if possible, we should use the various factor analysis techniques to examine the underlying structure. Cronbach's alpha is used to assess the reliability of a scale score, not the underlying structure.

References

Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78, 98-104.

Other references on reliability I posted to my CiteULike library

 

Plot the t-Distribution

A simple (but long) formula I use in Excel and OpenOffice Calc to plot the t-distribution.

Recently, for illustration, I need to plot the t-distribution. However, I failed to find a built-in function in Excel and OpenOffice Calc for this purpose. They have built-in functions for the cumulative distribution function of t-distribution, but not for the density function of t-distribution. (I would like to be corrected, if I am wrong.^_^) One solution is to type the probability density function myself into the spreadsheet:

where x is the variable having a t-distribution, and v is the number of degrees of freedom. Excel and OpenOffice Calc have the the natural logarithm of the gamma function (GAMMALN). Therefore, simple rearrangement can be used to find the natural logarithm of the t-distribution density function:

and then get the density from the exponential:

This is the formula in OpenOffice Calc (and should also work in Excel):

=EXP(GAMMALN((B$1+1)/2)+LN((1+(A2^2)/B$1)^((B$1+1)/(-2)))-LN(SQRT(B$1*PI()))-GAMMALN(B$1/2))

B$1 is the cell with the number of degrees of freedom (v), and A2 is the value (x) at which the density is to be computed. It is "B$1" because usually I will copy the formula to a column of cells, and the row reference of the number of degrees of freedom should be locked in the process. Note that the formula can actually be further simplified, and we can see that only one term involves x, and this term does not involve the gamma function. Nevertheless, I deliberately keep the current form, to make the relationship between this formula and the original formula similar. Here is a screenshot of a OpenOffice Calc spreadsheet I created, with a t-distribution (df=3), and a standard normal distribution for comparison:

To be honest, I am not very comfortable with this approach, as I am not sure if there will be serious error of approximation in the underlying algorithm of GAMMALN when used this way, especially for extreme values of x. Anyway, I compared the results for x = -5 to +5, in steps of .01, at several different numbers of degrees of freedom, with results from SPSS's built-in probability density function (PDF.T), and found no discrepancy at the eighth decimal place. In my case, it is good enough, as all I need is to plot the distribution for illustration. No need for high precision.

That's it. The obvious problem of the formula is its length. The more we type, the more likely we have a typo error. If anybody knows an easier way to plot the t-distribution in Excel or OpenOffice Calc, e.g., a built-in function that I missed, :p, please let me know.

Files for Illustration: Excel / OpenOffice.org Calc

Note: The sign for the gamma function is in italics. I don't know how to use normal form type for greek characters in OpenOffice Math. If anybody knows how to do it, I would like to know.