오늘의 회고
- 사실(Fact) : bike 실습 마무리, 주택가격 예측실습 (피처엔지니어링 연습)
- 느낌(Feeling) : 지수 로그도 반복하다보니 익숙해진다. 당뇨병이나 타이타닉보다 배경지식이 덜 필요한 bike데이터로 여러번 실습하다보니 피처 선택 이해가 좀 더 잘 되었다.
- 교훈(Finding) : 지금은 수업 복습만이라도 제대로 하는 것이 제일 좋은 공부인 것 같다.
dt.accessor
요일 변환 실습 : 의약품 처방 내역, 코로나 분석 참고
train["year"] = train["datetime"].dt.year
train["month"] = train["datetime"].dt.month
train["day"] = train["datetime"].dt.day
train["hour"] = train["datetime"].dt.hour
train["minute"] = train["datetime"].dt.minute
train["second"] = train["datetime"].dt.second
train["dayofweek"] = train["datetime"].dt.dayofweek #요일
ordinal_encoding
# one-hot-encoding => pd.get.dummies(), 순서가 없는 데이터에 인코딩
# ordinal-encoding => category 데이터 타입으로 변경하면 ordinal encoding을 할 수 있다.
# ordinal encoding
train["year_month_code"] = train["year_month"].astype("category").cat.codes
test["year_month_code"] = test["year_month"].astype("category").cat.codes
log를 씌우는 이유
- log를 count값에 적용하게 되면 한쪽에 너무 뾰족하게 있던 분포가 좀 더 완만한 정규 분포에 가까운 형태가 된다
- log를 취한 값을 사용하게 되면 이상치에도 덜 민감하게 된다
- 정규분포가 머신러닝에 좋은 성능을 내는 이유는 값이 한쪽에 너무 치우쳐져 있고 뾰족하다면 특성을 제대로 학습하기가 어렵기 때문에 정규분포로 되어 있다면 특성을 고르게 학습할 수 있기 때문이다.
로그함수와 지수함수train\["count\_log1p"\] = np.log(train\["count"\] + 1)
train\["count\_expm1"\] = np.exp(train\["count\_log1p"\]) - 1
np.exp 는 지수함수 입니다. np.log 로그함수
log를 취할 때는 1을 더하고 로그를 취했는데 지수함수를 적용할 때는 반대의 순서대로 복원해야 순서가 맞다.
np.exp로 지수함수를 적용하고 -1 을 해주어야 로그를 취했던 순서를 복원해 주게 된다.
np.expm1은 지수함수를 적용하고 -1을 해주는 순서로 되어있다.
주택 가격 실습
House Prices - Advanced Regression Techniques
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview
[House Prices - Advanced Regression Techniques | Kaggle
Evaluation
Goal
It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable.
Metric
Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)
Data fields
Data fields
Here's a brief version of what you'll find in the data description file.
- SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
- MSSubClass: The building class
- MSZoning: The general zoning classification
- LotFrontage: Linear feet of street connected to property
- LotArea: Lot size in square feet
- Street: Type of road access
- Alley: Type of alley access
- LotShape: General shape of property
- LandContour: Flatness of the property
- Utilities: Type of utilities available
- LotConfig: Lot configuration
- LandSlope: Slope of property
- Neighborhood: Physical locations within Ames city limits
- Condition1: Proximity to main road or railroad
- Condition2: Proximity to main road or railroad (if a second is present)
- BldgType: Type of dwelling
- HouseStyle: Style of dwelling
- OverallQual: Overall material and finish quality
- OverallCond: Overall condition rating
- YearBuilt: Original construction date
- YearRemodAdd: Remodel date
- RoofStyle: Type of roof
- RoofMatl: Roof material
- Exterior1st: Exterior covering on house
- Exterior2nd: Exterior covering on house (if more than one material)
- MasVnrType: Masonry veneer type
- MasVnrArea: Masonry veneer area in square feet
- ExterQual: Exterior material quality
- ExterCond: Present condition of the material on the exterior
- Foundation: Type of foundation
- BsmtQual: Height of the basement
- BsmtCond: General condition of the basement
- BsmtExposure: Walkout or garden level basement walls
- BsmtFinType1: Quality of basement finished area
- BsmtFinSF1: Type 1 finished square feet
- BsmtFinType2: Quality of second finished area (if present)
- BsmtFinSF2: Type 2 finished square feet
- BsmtUnfSF: Unfinished square feet of basement area
- TotalBsmtSF: Total square feet of basement area
- Heating: Type of heating
- HeatingQC: Heating quality and condition
- CentralAir: Central air conditioning
- Electrical: Electrical system
- 1stFlrSF: First Floor square feet
- 2ndFlrSF: Second floor square feet
- LowQualFinSF: Low quality finished square feet (all floors)
- GrLivArea: Above grade (ground) living area square feet
- BsmtFullBath: Basement full bathrooms
- BsmtHalfBath: Basement half bathrooms
- FullBath: Full bathrooms above grade
- HalfBath: Half baths above grade
- Bedroom: Number of bedrooms above basement level
- Kitchen: Number of kitchens
- KitchenQual: Kitchen quality
- TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
- Functional: Home functionality rating
- Fireplaces: Number of fireplaces
- FireplaceQu: Fireplace quality
- GarageType: Garage location
- GarageYrBlt: Year garage was built
- GarageFinish: Interior finish of the garage
- GarageCars: Size of garage in car capacity
- GarageArea: Size of garage in square feet
- GarageQual: Garage quality
- GarageCond: Garage condition
- PavedDrive: Paved driveway
- WoodDeckSF: Wood deck area in square feet
- OpenPorchSF: Open porch area in square feet
- EnclosedPorch: Enclosed porch area in square feet
- 3SsnPorch: Three season porch area in square feet
- ScreenPorch: Screen porch area in square feet
- PoolArea: Pool area in square feet
- PoolQC: Pool quality
- Fence: Fence quality
- MiscFeature: Miscellaneous feature not covered in other categories
- MiscVal: $Value of miscellaneous feature
- MoSold: Month Sold
- YrSold: Year Sold
- SaleType: Type of sale
- SaleCondition: Condition of sale
제공하는 데이터셋에 있는 주택가격을 예측하기 위한 변수로는 내외관 품질, 화장실의 수, 방의 개수, 수영장 여부, 지붕, 언제 건축이 되었는지 등의 데이터가 있다. 이 데이터셋을 통해 EDA를 해보고 피처엔지니어링을 거칠 것이다.
피처 엔지니어링
이상치가 학습을 방해하는 이유: 이상치까지 학습되어 과대적합의 우려가 있다
train 의 정답에 이상치가 있다면 어떻게 처리하는게 좋을까?
- train 값을 제거한다
- 스케일링을 해준다
희소치 처리
- 빈도가 낮은 항목을 기타로 묶어 처리해준다
- 희소값을 결측치 처리한다
- 희소치 처리를 통해 원핫인코딩을 할때 오버피팅을 방지할 수 있다
결측치 확인
test\_na = test.isnull().sum()
train\_sum = test\_na\[test\_na > 0\].sort\_values(ascending=False)