KI: PRAKTEK 13 — Proyek Akhir AI Security

From OnnoWiki
Jump to navigation Jump to search

Di praktikum ini kamu tidak lagi “coba-coba tool”. Kamu akan membangun produk keamanan mini yang:

  • punya input data jelas
  • punya proses deteksi
  • punya output keputusan (risk score + alasan)
  • punya laporan & demo yang bisa dipertanggungjawabkan

Kuncinya: jelaskan logika keamanan. AI hanya membantu.

Pilihan Proyek (Pilih 1)

  • AI Phishing Detector (paling “nyata”, mudah diuji)
  • AI Audit PDP (privacy compliance, cocok untuk log/CSV/dataset)
  • AI IDS Sederhana (network log/anomaly, menantang tapi seru)

Semua proyek punya kerangka sama (end-to-end).

Struktur Wajib Proyek (Sama untuk semua)

Tahap 0 — Setup Environment (Ubuntu 24.04)

Instal dependensi dasar

sudo apt update
sudo apt install -y python3 python3-venv python3-pip git gpg
python3 --version
gpg --version

Buat folder proyek

mkdir -p ~/ai-security-final/{data,src,reports,models}
cd ~/ai-security-final
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip

Paket Python (open-source)

Untuk semua opsi proyek:

pip install pandas scikit-learn numpy joblib rich

Opsional (kalau butuh parsing log lebih rapih / regex kuat):

pip install python-dateutil

Struktur folder minimal:

ai-security-final/
  data/
  src/
  models/
  reports/
  README.md

Tahap 1 — Keamanan Data Proyek (Wajib): GnuPG untuk Dataset & Output

Kenapa? Karena dataset dan laporan sering mengandung data sensitif. Kamu harus membuktikan bahwa kamu bisa mengamankan data.

1. Buat key GPG (untuk proyek)

gpg --full-generate-key

Pilih:

  • (1) RSA and RSA
  • 3072 atau 4096
  • nama: AI Security Student
  • email: student@lab.local

Cek key:

gpg --list-keys

2. Enkripsi dataset (contoh)

Misal dataset kamu data/phishing_samples.csv:

gpg --output data/phishing_samples.csv.gpg --symmetric --cipher-algo AES256 data/phishing_samples.csv

shred -u data/phishing_samples.csv

Decrypt saat butuh:

gpg --output data/phishing_samples.csv --decrypt data/phishing_samples.csv.gpg

Aturan proyek: dataset yang berisi data pribadi/berisiko harus disimpan terenkripsi atau minimal data dummy.

Tahap 2 — Pilih Proyek + Jalankan Step-by-step

Di bawah ini saya kasih 3 jalur proyek lengkap, masing-masing punya:

  • data contoh realistis
  • langkah implementasi
  • kode training + inference
  • output demo
  • format laporan

Kamu tinggal pilih salah satu.

OPSI A — AI Phishing Detector (Recommended)

Goal: deteksi pesan phishing dari teks email/chat → keluarkan label + risk score + alasan.

1. Siapkan dataset (realistis tapi aman)

Buat file: data/phishing_samples.csv (contoh mini, bisa kamu tambah)

text,label
"URGENT: Your account will be suspended. Verify now at http://secure-login.example.com",1
"Hi team, meeting moved to 3pm. Link: https://meet.example.org/abc",0
"Reset password now. Your mailbox is full. Click http://mailbox-reset.example.net",1
"Invoice attached, please review. Thanks",0
"Bank: unusual activity detected. Confirm your OTP at http://bank-verify.example.xyz",1
"Reminder: submit assignment before Friday",0

Label: 1=phishing, 0=benign.

Challenge: nanti kamu tambahkan 50–200 contoh (bisa dari teks buatan sendiri yang realistis).

2. Buat training script (ML sederhana + explainable)

Buat src/train_phishing.py:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
import joblib

DATA_PATH = "data/phishing_samples.csv"
MODEL_PATH = "models/phishing_model.joblib"

def main():
    df = pd.read_csv(DATA_PATH)
    X = df["text"].astype(str)
    y = df["label"].astype(int)

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42, stratify=y
    )

    model = Pipeline([
        ("tfidf", TfidfVectorizer(ngram_range=(1,2), min_df=1)),
        ("clf", LogisticRegression(max_iter=200))
    ])

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    print("=== Confusion Matrix ===")
    print(confusion_matrix(y_test, y_pred))
    print("\n=== Classification Report ===")
    print(classification_report(y_test, y_pred))

    joblib.dump(model, MODEL_PATH)
    print(f"\nSaved model to: {MODEL_PATH}")

if __name__ == "__main__":
    main()

Jalankan:

python src/train_phishing.py
3. Buat detector + alasan (top keywords)

Buat src/detect_phishing.py:

import joblib
from rich import print
from rich.console import Console

MODEL_PATH = "models/phishing_model.joblib"

SUSPICIOUS_HINTS = [
    "urgent", "verify", "reset", "suspended", "otp", "password",
    "click", "confirm", "limited", "account", "bank"
]

def explain_text(text: str):
    low = text.lower()
    hits = [h for h in SUSPICIOUS_HINTS if h in low]
    return hits[:10]

def main():
    model = joblib.load(MODEL_PATH)

    console = Console()
    console.print("[bold]AI Phishing Detector Demo[/bold]")
    console.print("Ketik pesan/email. Enter kosong untuk keluar.\n")

    while True:
        text = input("Message> ").strip()
        if not text:
            break

        proba = model.predict_proba([text])[0][1]  # prob phishing
        label = "PHISHING" if proba >= 0.5 else "BENIGN"
        hints = explain_text(text)

        print("\n[bold]Result[/bold]")
        print(f"Label     : [bold]{label}[/bold]")
        print(f"Risk score: [bold]{proba:.2f}[/bold] (0..1)")
        print(f"Reasons   : {hints if hints else 'No obvious keyword hints'}")
        print("-" * 60)

if __name__ == "__main__":
    main()

Run demo:

python src/detect_phishing.py

Contoh input nyata untuk demo:

  • “Admin: akun kamu akan nonaktif. klik link ini untuk verifikasi …”
  • “Tolong cek invoice, ada file .zip passwordnya 12345”
  • “Meeting jam 2, link google meet …”

Penilaian tinggi kalau kamu menambahkan: deteksi URL pendek, domain aneh, kata “urgent”, dan pattern yang sering dipakai scam.

OPSI B — AI Audit PDP (Privacy Audit Tool)

Goal: scan file CSV/log → deteksi personal data → laporan risiko + rekomendasi.

1. Dataset contoh

Buat data/sample_users.csv: name,email,phone,nik,address,notes

"Budi","budi@mail.com","08123456789","327xxxxxxxxxxxx","Bekasi","token=abc123"
"Siti","siti@gmail.com","082233445566","320xxxxxxxxxxxx","Jakarta","pwd=123456"
"Andi","andi@corp.co.id","081299988877","","Bandung","no issues"
2. Tool audit (regex + scoring)

Buat src/pdp_audit.py:

import re
import pandas as pd
from rich import print
from rich.table import Table

PATTERNS = {
    "EMAIL": re.compile(r"\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b"),
    "PHONE_ID": re.compile(r"\b(08\d{8,12})\b"),
    "NIK_LIKE": re.compile(r"\b\d{16}\b"),
    "TOKEN_LIKE": re.compile(r"\b(token|apikey|secret)\s*=\s*[A-Za-z0-9_-]{6,}\b", re.I),
    "PASSWORD_LIKE": re.compile(r"\b(pass|pwd|password)\s*=\s*\S+\b", re.I),
}

SEVERITY = {
    "EMAIL": 2,
    "PHONE_ID": 2,
    "NIK_LIKE": 4,
    "TOKEN_LIKE": 4,
    "PASSWORD_LIKE": 5,
}

def scan_value(val: str):
    findings = []
    for k, rx in PATTERNS.items():
        if rx.search(val):
            findings.append(k)
    return findings

def main():
    path = "data/sample_users.csv"
    df = pd.read_csv(path)

    findings_rows = []
    total_score = 0

    for idx, row in df.iterrows():
        row_findings = []
        row_score = 0
        for col, v in row.items():
            s = "" if pd.isna(v) else str(v)
            hits = scan_value(s)
            for h in hits:
                row_findings.append((col, h))
                row_score += SEVERITY[h]
        total_score += row_score
        findings_rows.append((idx, row_score, row_findings))

    table = Table(title="PDP Audit Report (Quick Scan)")
    table.add_column("Row", justify="right")
    table.add_column("Risk Score", justify="right")
    table.add_column("Findings")

    for idx, score, f in findings_rows:
        pretty = ", ".join([f"{col}:{tag}" for col, tag in f]) if f else "-"
        table.add_row(str(idx), str(score), pretty)

    print(table)
    print(f"\n[bold]Total Risk Score:[/bold] {total_score}")

    if total_score >= 20:
        print("[bold red]High risk:[/bold red] segera lakukan masking/encryption & akses kontrol.")
    elif total_score >= 10:
        print("[bold yellow]Medium risk:[/bold yellow] audit consent + minimisasi data.")
    else:
        print("[bold green]Low risk:[/bold green] tetap pastikan retention & akses log.")

if __name__ == "__main__":
    main()

Run:

python src/pdp_audit.py

Bonus nilai: hasil audit disimpan jadi file reports/pdp_report.txt lalu dienkripsi pakai GPG.

OPSI C — AI IDS Sederhana (Anomaly Detection dari Log)

Goal: baca log koneksi → deteksi “aneh” (contoh: port scanning / brute force) → alert.

1. Buat dataset log sederhana (contoh realistis)

Buat data/connections.csv:

src_ip,dst_port,count_per_minute
192.168.1.10,80,5
192.168.1.11,22,3
192.168.1.50,22,60
192.168.1.50,23,55
192.168.1.50,445,40
192.168.1.12,443,4
192.168.1.13,80,6

Interpretasi:

  • IP 192.168.1.50 “rame banget” ke banyak port → scan/bruteforce suspicion

2. Anomaly detection dengan IsolationForest

Buat src/ids_anomaly.py:

import pandas as pd
from sklearn.ensemble import IsolationForest
from rich import print
from rich.table import Table

def main():
    df = pd.read_csv("data/connections.csv") 

    # features sederhana
    X = df"dst_port", "count_per_minute".astype(float)

    model = IsolationForest(contamination=0.2, random_state=42)
    df["anomaly"] = model.fit_predict(X)  # -1 anomaly, 1 normal
    df["score"] = model.decision_function(X)  # semakin kecil = semakin aneh

    table = Table(title="AI IDS Sederhana (Anomaly Detection)")
    table.add_column("src_ip")
    table.add_column("dst_port", justify="right")
    table.add_column("count/min", justify="right")
    table.add_column("anomaly")
    table.add_column("score", justify="right")

    for _, r in df.sort_values("score").iterrows():
        tag = "[bold red]ALERT[/bold red]" if r["anomaly"] == -1 else "OK"
        table.add_row(
            str(r["src_ip"]),
            str(int(r["dst_port"])),
            str(int(r["count_per_minute"])),
            tag,
            f"{r['score']:.3f}"
        )

    print(table)

    alerts = df[df["anomaly"] == -1]
    if len(alerts) > 0:
        print("\n[bold]Suggested Investigation Steps:[/bold]")
        print("- Cek apakah IP itu user normal atau device tak dikenal")
        print("- Cek log auth (/var/log/auth.log) jika port 22 dominan")
        print("- Jika banyak port berbeda: kemungkinan port scanning")
    else:
        print("\n[bold green]No anomaly detected[/bold green]")

if __name__ == "__main__":
    main()

Run:

python src/ids_anomaly.py

Bonus nilai: integrasikan dengan log nyata dari auth.log atau ufw.log (tanpa data pribadi).


Tahap 3 — Output Wajib: Tool + Laporan + Demo 1. Tool (Wajib) Tool kamu minimal bisa: menerima input (file / teks) menghasilkan output (label/alert/report) punya cara menjalankan yang jelas: python src/... 2. Laporan (Template singkat tapi kuat) Buat reports/report.md (minimal isi): Latar belakang masalah Threat model ringkas (siapa attacker, target, impact) Desain sistem Dataset & pengamanan data (pakai GPG? masking?) Hasil uji (contoh output, confusion matrix / alert list) Limitasi & potensi kesalahan Rekomendasi perbaikan 3. Demo (Wajib) Demo 3–5 menit: jelaskan masalah jalankan tool live tunjukkan output jelaskan “kenapa” hasilnya begitu Rubrik Penilaian Kalau mau nilai tinggi, fokus ke ini: End-to-end berfungsi (bukan potongan kode) output ada risk score + alasan ada evaluasi & limitation (AI bisa salah di mana) data sensitif ditangani: mask/encrypt (GPG) dokumentasi rapi: README + report

Checklist Final (Sebelum Submit) src/ berisi script utama data/ aman (dummy atau terenkripsi GPG) models/ ada model kalau proyek ML reports/report.md ada Demo bisa jalan di Ubuntu 24.04 dengan perintah jelas Semua open-source, tanpa proprietary

Pranala Menarik