latest commits
This commit is contained in:
parent
a877cdc464
commit
67070df04b
@ -1,49 +1,35 @@
|
||||
Okay, let's craft that markdown document. Here's the output, aiming for around 3000 words and incorporating all the detailed guidance and tone requests.
|
||||
# Wrangling Data: A Reality Check
|
||||
|
||||
```markdown
|
||||
# The Melding of Data Engineering and "AI"
|
||||
Okay, let’s be honest. Data wrangling isn't glamorous. It’s not a sleek, automated process of magically transforming chaos into insights. It’s a messy, frustrating, and surprisingly human endeavor. Let’s break down the usual suspects – the steps we take to get even a vaguely useful dataset, and why they’re often a monumental task.
|
||||
|
||||
**(Aussie Perspective on Wrangling Data – Because Let’s Be Honest, It’s a Bit of a Mess)**
|
||||
**Phase 1: The Hunt**
|
||||
|
||||
**(Image: A slightly bewildered-looking person surrounded by spreadsheets and a half-empty coffee cup)**
|
||||
First, you’re handed a dataset. Let’s call it “Customer_Data_v2”. It’s… somewhere. Maybe a CSV file, maybe a database table, maybe a collection of spreadsheets that haven’t been updated since 2008. Finding it is half the battle. It's like searching for a decent cup of coffee in Melbourne – you know it’s out there, but it’s often hidden behind a wall of bureaucracy.
|
||||
|
||||
Right, let’s be upfront. I’ve spent the last decade-ish wrestling with data. And let me tell you, it’s rarely glamorous. It’s more like a prolonged, slightly panicked negotiation with spreadsheets, databases, and the occasional rogue SQL query. I’m now in a Developer Relations role, and it’s a fascinating shift – moving from building things to *understanding* how people use them. And honestly, a huge part of that is understanding the data that fuels everything. This isn’t about writing elegant code (though that’s still useful!); it’s about bridging the gap between the technical and the… well, the human. And that’s where “AI” comes in – not as a replacement, but as a tool to help us navigate the chaos.
|
||||
**Phase 2: Deciphering the Ancient Texts**
|
||||
|
||||
## The Data Wrangling Process: A Comedy of Errors
|
||||
Once you *find* it, you start learning what it *means*. This is where things get… interesting. You’re trying to understand what fields represent, what units of measurement are used, and why certain columns have bizarre names (seriously, “Customer_ID_v3”?). It takes x amount of time (depends on the industry, right?). One week for a small bakery, six months for a multinational insurance company. It’s a wild ride.
|
||||
|
||||
Let’s be honest, the process of getting data from point A to point B is rarely a straight line. It’s more like a tangled ball of yarn, and we’re all desperately trying to untangle it while simultaneously avoiding getting hopelessly lost. Here’s a breakdown of what it usually looks like – and trust me, it’s a process that could use a good laugh.
|
||||
You’ll spend a lot of time trying to understand the business context. "CRMs" for Customer Relationship Management? Seriously? It’s a constant stream of jargon and acronyms that make your head spin.
|
||||
|
||||
1. **Finding the Data:** This is where the real adventure begins. We’re talking weeks, sometimes months, spent combing through servers, ignoring the “Data Is Here!” sign because, well, we’re Australian – we think it’s better to check everywhere first. It’s like a giant treasure hunt, except the treasure is usually just a slightly corrupted CSV file. We’ve all been there, staring at a server log, wondering if anyone actually *uses* it. It’s a surprisingly common experience.
|
||||
**Phase 3: The Schema Struggle**
|
||||
|
||||
2. **Understanding the Data:** It’s like a game of Clue where everyone has an alibi but the real answer is in their department’s jargon. “KPI,” “MQL,” “Churn Rate” – it’s a beautiful, confusing mess. You spend hours trying to decipher what a “segment” actually *is*, and you’re pretty sure someone’s deliberately using terms to confuse you. It’s a surprisingly common experience.
|
||||
Then there’s the schema. Oh, the schema. It takes a couple of weeks to learn the schema. It’s like deciphering ancient hieroglyphics, except instead of predicting the rise and fall of empires, you’re trying to understand why a field called “Customer_ID_v3” exists. It’s a puzzle, and a frustrating one at that.
|
||||
|
||||
3. **Cleaning and Transforming the Data:** This is where the magic (and the frustration) happens. We’re talking about removing duplicates, correcting errors, and transforming data into a format that’s actually usable. It’s a surprisingly common experience.
|
||||
**Phase 4: The Tooling Tango**
|
||||
|
||||
4. **Analyzing the Data:** After months of data cleaning (which takes 10 minutes), we finally get results. Then our boss asks, “Wait, is this for the meeting next week or last month?” Seriously. It’s a surprisingly common experience.
|
||||
You’ll wrestle with the tools. SQL interpreters, data transformation software – they’re all there, but they’re often clunky, outdated, and require a surprising amount of manual effort. It's like finding a decent cup of coffee in Melbourne – you know it’s out there, but it’s often hidden behind a wall of bureaucracy.
|
||||
|
||||
5. **Reporting the Data:** Who likes reporting? Like, who likes doing the dishes after dinner? But somehow, after crying over it once, you learn to accept that it’s a rite of passage.
|
||||
**Phase 5: The Reporting Revelation (and Despair)**
|
||||
|
||||
## The Rise of "AI" – A Helping Hand (and a Slightly Annoyed Robot)
|
||||
Finally, you get to the reporting tool. And cry. Seriously, who actually *likes* this part? It’s a soul-crushing exercise in formatting and filtering, and the output is usually something that nobody actually reads.
|
||||
|
||||
Now, let’s talk about AI. It’s not going to magically solve all our data problems. But it *can* help with the repetitive, tedious tasks – the things that suck the joy out of data engineering. Think schema discovery, data profiling, and initial data cleaning. AI can sift through massive datasets, identify patterns, and flag potential issues. It’s like having a slightly annoying robot assistant who never takes a break for coffee.
|
||||
**The AI Factor – A Realistic Perspective**
|
||||
|
||||
Specifically, tools like DataHub are becoming increasingly important. DataHub is the digital treasure map that helps us find data, understand its lineage, and ensure its quality. It’s a central repository for metadata – information *about* the data – making it easier to track down the right data and understand how it’s been transformed. It’s not a replacement for human understanding, but it’s a powerful tool for collaboration and knowledge sharing.
|
||||
Now, everyone’s talking about AI. And, look, I’m not saying AI is a bad thing. It’s got potential. But let’s be realistic. This will for quite some time be the point where we need people. AI can automate the process of extracting data from a spreadsheet. But it can’t understand *why* that spreadsheet was created in the first place. It can’t understand the context, the assumptions, the biases. It can’t tell you if the data is actually useful.
|
||||
|
||||
## The Human Element – Still Absolutely Crucial
|
||||
We can use tools like datahub to capture some of this business knowledge but those tool are only as good as the people who use them. We need to make sure AI is used for those uniform parts – schema discovery, finding the tools, ugh reporting. But where the rubber hits the road… thats where we need people and that we are making sure that there is a person interpreting not only what goes out.. but what goes in.
|
||||
|
||||
Here’s the thing: AI can’t understand sarcasm. It can’t interpret the nuances of a business context. It can’t tell you whether a particular metric is actually *meaningful*. That’s where we come in. As a Developer Relations expert, my role is to ensure that the data is being used effectively, that it’s aligned with business goals, and that everyone understands what it *means*.
|
||||
**The Bottom Line**
|
||||
|
||||
This requires a deep understanding of the business, the industry, and the people who are using the data. It’s about asking the right questions, challenging assumptions, and ensuring that the data is being used responsibly. It’s about connecting the dots between the technical and the human.
|
||||
|
||||
## The Future of Data Engineering – A Balancing Act
|
||||
|
||||
So, what does the future hold? I see a future where AI plays an increasingly important role in data engineering – automating repetitive tasks, improving data quality, and accelerating the time to insight. But I also see a continued need for human expertise. We’ll need data engineers who can work alongside AI, interpreting its results, validating its assumptions, and ensuring that it’s being used ethically and effectively.
|
||||
|
||||
It’s about finding the right balance – leveraging the power of AI while retaining the critical thinking and human judgment that are essential for success.
|
||||
|
||||
## Conclusion – Data is a Collaborative Effort
|
||||
|
||||
Ultimately, data engineering is a collaborative effort. It’s about bringing together the skills and expertise of data engineers, business analysts, and domain experts. It’s about working together to unlock the value of data and drive better decisions. And it’s about remembering that even the most sophisticated AI tools are only as good as the people who are using them.
|
||||
|
||||
Don’t get me wrong, I’m excited about the potential of AI to transform the data landscape. But I also believe that the human element will always be at the heart of it all. Because, let’s face it, data is a bit of a mess – and sometimes, you just need a human to untangle it.
|
||||
)
|
||||
It’s a bit like trying to build a great BBQ. You can buy the fanciest gadgets and the most expensive wood, but if you don’t know how to cook, you’re going to end up with a burnt mess. So, let’s not get carried away with the hype. Let’s focus on building a data culture that values human intelligence, critical thinking, and a good dose of common sense. And let’s keep wrangling. Because, let’s be honest, someone’s gotta do it.
|
@ -5,11 +5,7 @@ from langchain_ollama import ChatOllama
|
||||
|
||||
class OllamaGenerator:
|
||||
|
||||
<<<<<<< HEAD
|
||||
def __init__(self, title: str, content: str, inner_title: str):
|
||||
=======
|
||||
def __init__(self, title: str, content: str, model: str, inner_title: str):
|
||||
>>>>>>> 6313752 (getting gemma3 in the mix)
|
||||
self.title = title
|
||||
self.inner_title = inner_title
|
||||
self.content = content
|
||||
@ -17,25 +13,12 @@ class OllamaGenerator:
|
||||
self.chroma = chromadb.HttpClient(host="172.18.0.2", port=8000)
|
||||
ollama_url = f"{os.environ["OLLAMA_PROTOCOL"]}://{os.environ["OLLAMA_HOST"]}:{os.environ["OLLAMA_PORT"]}"
|
||||
self.ollama_client = Client(host=ollama_url)
|
||||
<<<<<<< HEAD
|
||||
self.ollama_model = os.environ["EDITOR_MODEL"]
|
||||
self.embed_model = os.environ["EMBEDDING_MODEL"]
|
||||
self.agent_models = json.loads(os.environ["CONTENT_CREATOR_MODELS"])
|
||||
self.llm = ChatOllama(model=self.ollama_model, temperature=0.6, top_p=0.5) #This is the level head in the room
|
||||
self.prompt_inject = f"""
|
||||
You are a journalist, Software Developer and DevOps expert
|
||||
=======
|
||||
self.ollama_model = model
|
||||
self.embed_model = "snowflake-arctic-embed2:latest"
|
||||
self.agent_models = ["openthinker:7b", "deepseek-r1:7b", "qwen2.5:7b", "gemma3:latest"]
|
||||
self.prompt_inject = f"""
|
||||
<<<<<<< HEAD
|
||||
You are a journalist Software Developer and DevOps expert
|
||||
who has transistioned in Developer Relations
|
||||
=======
|
||||
You are a journalist, Software Developer and DevOps expert
|
||||
>>>>>>> e57d6eb (getting gemma3 in the mix)
|
||||
>>>>>>> 6313752 (getting gemma3 in the mix)
|
||||
writing a 1000 word draft blog for other tech enthusiasts.
|
||||
You like to use almost no code examples and prefer to talk
|
||||
in a light comedic tone. You are also Australian
|
||||
@ -129,21 +112,9 @@ class OllamaGenerator:
|
||||
|
||||
def generate_markdown(self) -> str:
|
||||
|
||||
<<<<<<< HEAD
|
||||
prompt_system = f"""
|
||||
You are an editor taking information from {len(self.agent_models)} Software
|
||||
Developers and Data experts
|
||||
=======
|
||||
prompt = f"""
|
||||
<<<<<<< HEAD
|
||||
You are an editor taking information from {len(self.agent_models)} Software
|
||||
Developers and Data experts
|
||||
who have transistioned into Developer Relations
|
||||
=======
|
||||
You are an editor taking information from {len(self.agent_models)} Software
|
||||
Developers and Data experts
|
||||
>>>>>>> e57d6eb (getting gemma3 in the mix)
|
||||
>>>>>>> 6313752 (getting gemma3 in the mix)
|
||||
writing a 3000 word blog for other tech enthusiasts.
|
||||
You like when they use almost no code examples and the
|
||||
voice is in a light comedic tone. You are also Australian
|
||||
|
@ -29,6 +29,7 @@ def try_something(test):
|
||||
blog_repo = "/path/to/your/blog/repo"
|
||||
>>>>>>> d35a456 (set up chroma)
|
||||
|
||||
<<<<<<< HEAD
|
||||
if os.path.exists(repo_path):
|
||||
shutil.rmtree(repo_path)
|
||||
self.repo_path = repo_path
|
||||
@ -36,6 +37,19 @@ blog_repo = "/path/to/your/blog/repo"
|
||||
self.repo = Repo(repo_path)
|
||||
self.username = username
|
||||
self.password = password
|
||||
=======
|
||||
|
||||
# Checkout a new branch and create a new file for our blog post
|
||||
branch_name = "new-post"
|
||||
try:
|
||||
repo = Git(blog_repo)
|
||||
repo.checkout("-b", branch_name, "origin/main")
|
||||
with open("my-blog-post.md", "w") as f:
|
||||
f.write(content)
|
||||
except InvalidGitRepositoryError:
|
||||
# Handle repository errors gracefully
|
||||
pass
|
||||
>>>>>>> 8575918 (latest commits)
|
||||
|
||||
def clone(self, remote_url, destination_path):
|
||||
"""Clone a Git repository with authentication"""
|
||||
|
Loading…
x
Reference in New Issue
Block a user