データセットパッケージ¶

statsmodelsは、例、チュートリアル、モデルテストなどで使用するためのデータセット（データとメタデータ）を提供します。

Stataからのデータセットの使用¶

`webuse`(data[, baseurl, as_df])	Stataからサンプルデータセットをダウンロードして返します。

Rからのデータセットの使用¶

Rdatasetsプロジェクトは、Rのコアデータセットパッケージやその他の一般的なRパッケージで利用可能なデータセットへのアクセスを提供します。これらのデータセットはすべて、get_rdataset関数を使用してstatsmodelsから利用できます。実際のデータはdata属性からアクセスできます。例えば

In [1]: import statsmodels.api as sm

In [2]: duncan_prestige = sm.datasets.get_rdataset("Duncan", "carData")

In [3]: print(duncan_prestige.__doc__)
.. container::

   .. container::

      ====== ===============
      Duncan R Documentation
      ====== ===============

      .. rubric:: Duncan's Occupational Prestige Data
         :name: duncans-occupational-prestige-data

      .. rubric:: Description
         :name: description

      The ``Duncan`` data frame has 45 rows and 4 columns. Data on the
      prestige and other characteristics of 45 U. S. occupations in
      1950.

      .. rubric:: Usage
         :name: usage

      .. code:: R

         Duncan

      .. rubric:: Format
         :name: format

      This data frame contains the following columns:

      type
         Type of occupation. A factor with the following levels:
         ``prof``, professional and managerial; ``wc``, white-collar;
         ``bc``, blue-collar.

      income
         Percentage of occupational incumbents in the 1950 US Census who
         earned $3,500 or more per year (about $36,000 in 2017 US
         dollars).

      education
         Percentage of occupational incumbents in 1950 who were high
         school graduates (which, were we cynical, we would say is
         roughly equivalent to a PhD in 2017)

      prestige
         Percentage of respondents in a social survey who rated the
         occupation as “good” or better in prestige

      .. rubric:: Source
         :name: source

      Duncan, O. D. (1961) A socioeconomic index for all occupations. In
      Reiss, A. J., Jr. (Ed.) *Occupations and Social Status.* Free
      Press [Table VI-1].

      .. rubric:: References
         :name: references

      Fox, J. (2016) *Applied Regression Analysis and Generalized Linear
      Models*, Third Edition. Sage.

      Fox, J. and Weisberg, S. (2019) *An R Companion to Applied
      Regression*, Third Edition, Sage.


In [4]: duncan_prestige.data.head(5)
Out[4]: 
            type  income  education  prestige
rownames                                     
accountant  prof      62         86        82
pilot       prof      72         76        83
architect   prof      75         92        90
author      prof      55         90        76
chemist     prof      64         86        90

Rデータセット関数リファレンス¶

`get_rdataset`(dataname[, package, cache])	Rデータセットをダウンロードして返します
`get_data_home`([data_home])	statsmodelsデータディレクトリのパスを返します。
`clear_data_home`([data_home])	データホームキャッシュのすべてのコンテンツを削除します。

利用可能なデータセット¶

使用方法¶

データセットの読み込み

In [5]: import statsmodels.api as sm

In [6]: data = sm.datasets.longley.load_pandas()

Datasetオブジェクトはbunchパターンに従います。完全なデータセットはdata属性で利用できます。

In [7]: data.data
Out[7]: 
     TOTEMP  GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
 60323.0     83.0  234289.0  2356.0  1590.0  107608.0  1947.0
 61122.0     88.5  259426.0  2325.0  1456.0  108632.0  1948.0
 60171.0     88.2  258054.0  3682.0  1616.0  109773.0  1949.0
 61187.0     89.5  284599.0  3351.0  1650.0  110929.0  1950.0
 63221.0     96.2  328975.0  2099.0  3099.0  112075.0  1951.0
 63639.0     98.1  346999.0  1932.0  3594.0  113270.0  1952.0
 64989.0     99.0  365385.0  1870.0  3547.0  115094.0  1953.0
 63761.0    100.0  363112.0  3578.0  3350.0  116219.0  1954.0
 66019.0    101.2  397469.0  2904.0  3048.0  117388.0  1955.0
 67857.0    104.6  419180.0  2822.0  2857.0  118734.0  1956.0
68169.0    108.4  442769.0  2936.0  2798.0  120445.0  1957.0
66513.0    110.8  444546.0  4681.0  2637.0  121950.0  1958.0
68655.0    112.6  482704.0  3813.0  2552.0  123366.0  1959.0
69564.0    114.2  502601.0  3931.0  2514.0  125368.0  1960.0
69331.0    115.7  518173.0  4806.0  2572.0  127852.0  1961.0
70551.0    116.9  554894.0  4007.0  2827.0  130081.0  1962.0

ほとんどのデータセットは、endog属性とexog属性にデータの便利な表現を保持しています。

In [8]: data.endog.iloc[:5]
Out[8]: 
0    60323.0
1    61122.0
2    60171.0
3    61187.0
4    63221.0
Name: TOTEMP, dtype: float64

In [9]: data.exog.iloc[:5,:]
Out[9]: 
   GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
0     83.0  234289.0  2356.0  1590.0  107608.0  1947.0
1     88.5  259426.0  2325.0  1456.0  108632.0  1948.0
2     88.2  258054.0  3682.0  1616.0  109773.0  1949.0
3     89.5  284599.0  3351.0  1650.0  110929.0  1950.0
4     96.2  328975.0  2099.0  3099.0  112075.0  1951.0

ただし、単変量データセットにはexog属性がありません。

変数名は、次のように入力することで取得できます。

In [10]: data.endog_name
Out[10]: 'TOTEMP'

In [11]: data.exog_name
Out[11]: ['GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']

データセットにendogとexogを明確に解釈できない場合、いつでもdata属性またはraw_data属性にアクセスできます。これは、特定の例を念頭に置いたデータセットではなく、米国のマクロ経済データの集合であるmacrodataデータセットの場合です。data属性には完全なデータセットのレコード配列が含まれ、raw_data属性には、names属性によって与えられる列名が付けられたndarrayが含まれています。

In [12]: type(data.data)
Out[12]: pandas.core.frame.DataFrame

In [13]: type(data.raw_data)
Out[13]: pandas.core.frame.DataFrame

In [14]: data.names
Out[14]: ['TOTEMP', 'GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']

pandasオブジェクトとしてのデータの読み込み¶

多くのユーザーにとって、pandas DataFrameまたはSeriesオブジェクトとしてデータセットを取得する方が好ましい場合があります。各データセットモジュールには、pandasオブジェクトとしてすぐに利用できるデータを備えたDatasetインスタンスを返すload_pandasメソッドが装備されています。

In [15]: data = sm.datasets.longley.load_pandas()

In [16]: data.exog
Out[16]: 
    GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
    83.0  234289.0  2356.0  1590.0  107608.0  1947.0
    88.5  259426.0  2325.0  1456.0  108632.0  1948.0
    88.2  258054.0  3682.0  1616.0  109773.0  1949.0
    89.5  284599.0  3351.0  1650.0  110929.0  1950.0
    96.2  328975.0  2099.0  3099.0  112075.0  1951.0
    98.1  346999.0  1932.0  3594.0  113270.0  1952.0
    99.0  365385.0  1870.0  3547.0  115094.0  1953.0
   100.0  363112.0  3578.0  3350.0  116219.0  1954.0
   101.2  397469.0  2904.0  3048.0  117388.0  1955.0
   104.6  419180.0  2822.0  2857.0  118734.0  1956.0
  108.4  442769.0  2936.0  2798.0  120445.0  1957.0
  110.8  444546.0  4681.0  2637.0  121950.0  1958.0
  112.6  482704.0  3813.0  2552.0  123366.0  1959.0
  114.2  502601.0  3931.0  2514.0  125368.0  1960.0
  115.7  518173.0  4806.0  2572.0  127852.0  1961.0
  116.9  554894.0  4007.0  2827.0  130081.0  1962.0

In [17]: data.endog
Out[17]: 
   60323.0
   61122.0
   60171.0
   61187.0
   63221.0
   63639.0
   64989.0
   63761.0
   66019.0
   67857.0
  68169.0
  66513.0
  68655.0
  69564.0
  69331.0
  70551.0
Name: TOTEMP, dtype: float64

完全なDataFrameは、Datasetオブジェクトのdata属性で利用できます。

In [18]: data.data
Out[18]: 
     TOTEMP  GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
 60323.0     83.0  234289.0  2356.0  1590.0  107608.0  1947.0
 61122.0     88.5  259426.0  2325.0  1456.0  108632.0  1948.0
 60171.0     88.2  258054.0  3682.0  1616.0  109773.0  1949.0
 61187.0     89.5  284599.0  3351.0  1650.0  110929.0  1950.0
 63221.0     96.2  328975.0  2099.0  3099.0  112075.0  1951.0
 63639.0     98.1  346999.0  1932.0  3594.0  113270.0  1952.0
 64989.0     99.0  365385.0  1870.0  3547.0  115094.0  1953.0
 63761.0    100.0  363112.0  3578.0  3350.0  116219.0  1954.0
 66019.0    101.2  397469.0  2904.0  3048.0  117388.0  1955.0
 67857.0    104.6  419180.0  2822.0  2857.0  118734.0  1956.0
68169.0    108.4  442769.0  2936.0  2798.0  120445.0  1957.0
66513.0    110.8  444546.0  4681.0  2637.0  121950.0  1958.0
68655.0    112.6  482704.0  3813.0  2552.0  123366.0  1959.0
69564.0    114.2  502601.0  3931.0  2514.0  125368.0  1960.0
69331.0    115.7  518173.0  4806.0  2572.0  127852.0  1961.0
70551.0    116.9  554894.0  4007.0  2827.0  130081.0  1962.0

推定クラスへのpandasの統合により、メタデータはモデルの結果に添付されます。

In [19]: y, x = data.endog, data.exog

In [20]: res = sm.OLS(y, x).fit()

In [21]: res.params
Out[21]: 
GNPDEFL   -52.993570
GNP         0.071073
UNEMP      -0.423466
ARMED      -0.572569
POP        -0.414204
YEAR       48.417866
dtype: float64

In [22]: res.summary()
Out[22]: 
<class 'statsmodels.iolib.summary.Summary'>
"""
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                 TOTEMP   R-squared (uncentered):                   1.000
Model:                            OLS   Adj. R-squared (uncentered):              1.000
Method:                 Least Squares   F-statistic:                          5.052e+04
Date:                Thu, 03 Oct 2024   Prob (F-statistic):                    8.20e-22
Time:                        16:08:41   Log-Likelihood:                         -117.56
No. Observations:                  16   AIC:                                      247.1
Df Residuals:                      10   BIC:                                      251.8
Df Model:                           6                                                  
Covariance Type:            nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
GNPDEFL      -52.9936    129.545     -0.409      0.691    -341.638     235.650
GNP            0.0711      0.030      2.356      0.040       0.004       0.138
UNEMP         -0.4235      0.418     -1.014      0.335      -1.354       0.507
ARMED         -0.5726      0.279     -2.052      0.067      -1.194       0.049
POP           -0.4142      0.321     -1.289      0.226      -1.130       0.302
YEAR          48.4179     17.689      2.737      0.021       9.003      87.832
==============================================================================
Omnibus:                        1.443   Durbin-Watson:                   1.277
Prob(Omnibus):                  0.486   Jarque-Bera (JB):                0.605
Skew:                           0.476   Prob(JB):                        0.739
Kurtosis:                       3.031   Cond. No.                     4.56e+05
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3] The condition number is large, 4.56e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
"""

追加情報¶

データセット自体について詳しく知りたい場合は、Longleyデータセットを例に、次のようにアクセスできます。

>>> dir(sm.datasets.longley)[:6]
['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']

追加情報¶

データセットパッケージのアイデアは、もともとDavid Cournapeauによって提案されました。
データセットを追加するには、データセットの追加に関するメモを参照してください。

最終更新日：2024年10月3日